1
|
Chen S, Keleş S. GEEES: inferring cell-specific gene-enhancer interactions from multi-modal single-cell data. Bioinformatics 2024; 40:btae638. [PMID: 39468737 PMCID: PMC11549018 DOI: 10.1093/bioinformatics/btae638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 10/17/2024] [Accepted: 10/25/2024] [Indexed: 10/30/2024] Open
Abstract
MOTIVATION Gene-enhancer interactions are central to transcriptional regulation. Current multi-modal single-cell datasets that profile transcriptome and chromatin accessibility simultaneously in a single cell are yielding opportunities to infer gene-enhancer associations in a cell type specific manner. Computational efforts for such multi-modal single-cell datasets thus far focused on methods for identification and refinement of cell types and trajectory construction. While initial attempts for inferring gene-enhancer interactions have emerged, these have not been evaluated against benchmark datasets that materialized from bulk genomic experiments. Furthermore, existing approaches are limited to inferring gene-enhancer associations at the level of grouped cells as opposed to individual cells, thereby ignoring regulatory heterogeneity among the cells. RESULTS We present a new approach, GEEES for "Gene EnhancEr IntEractions from Multi-modal Single Cell Data," for inferring gene-enhancer associations at the single-cell level using multi-modal single-cell transcriptome and chromatin accessibility data. We evaluated GEEES alongside several multivariate regression-based alternatives we devised and state-of-the-art methods using a large number of benchmark datasets, providing a comprehensive assessment of current approaches. This analysis revealed significant discrepancies between gold-standard interactions and gene-enhancer associations derived from multi-modal single-cell data. Notably, incorporating gene-enhancer distance into the analysis markedly improved performance across all methods, positioning GEEES as a leading approach in this domain. While the overall improvement in performance metrics by GEEES is modest, it provides enhanced cell representation learning which can be leveraged for more effective downstream analysis. Furthermore, our review of existing experimentally driven benchmark datasets uncovers their limited concordance, underscoring the necessity for new high-throughput experiments to validate gene-enhancer interactions inferred from single-cell data. AVAILABILITY AND IMPLEMENTATION https://github.com/keleslab/GEEES.
Collapse
Affiliation(s)
- Shuyang Chen
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Sündüz Keleş
- Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, United States
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53706, United States
| |
Collapse
|
2
|
Anguita-Ruiz A, Amine I, Stratakis N, Maitre L, Julvez J, Urquiza J, Luo C, Nieuwenhuijsen M, Thomsen C, Grazuleviciene R, Heude B, McEachan R, Vafeiadi M, Chatzi L, Wright J, Yang TC, Slama R, Siroux V, Vrijheid M, Basagaña X. Beyond the single-outcome approach: A comparison of outcome-wide analysis methods for exposome research. ENVIRONMENT INTERNATIONAL 2023; 182:108344. [PMID: 38016387 DOI: 10.1016/j.envint.2023.108344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 10/16/2023] [Accepted: 11/20/2023] [Indexed: 11/30/2023]
Abstract
Outcome-wide analysis can offer several benefits, including increased power to detect weak signals and the ability to identify exposures with multiple effects on health, which may be good targets for preventive measures. Recently, advanced statistical multivariate techniques for outcome-wide analysis have been developed, but they have been rarely applied to exposome analysis. In this work, we provide an overview of a selection of methods that are well-suited for outcome-wide exposome analysis and are implemented in the R statistical software. Our work brings together six different methods presenting innovative solutions for typical problems arising from outcome-wide approaches in the context of the exposome, including dependencies among outcomes, high dimensionality, mixed-type outcomes, missing data records, and confounding effects. The identified methods can be grouped into four main categories: regularized multivariate regression techniques, multi-task learning approaches, dimensionality reduction approaches, and bayesian extensions of the multivariate regression framework. Here, we compare each technique presenting its main rationale, strengths, and limitations, and provide codes and guidelines for their application to exposome data. Additionally, we apply all selected methods to a real exposome dataset from the Human Early-Life Exposome (HELIX) project, demonstrating their suitability for exposome research. Although the choice of the best method will always depend on the challenges to be faced in each application, for an exposome-like analysis we find dimensionality reduction and bayesian methods such as reduced rank regression (RRR) or multivariate bayesian shrinkage priors (MBSP) particularly useful, given their ability to deal with critical issues such as collinearity, high-dimensionality, missing data or quantification of uncertainty.
Collapse
Affiliation(s)
- Augusto Anguita-Ruiz
- ISGlobal, 08003 Barcelona, Spain; CIBEROBN (CIBER Physiopathology of Obesity and Nutrition), Instituto de Salud Carlos III, 28029 Madrid, Spain
| | - Ines Amine
- University Grenoble Alpes, Inserm U 1209, CNRS UMR 5309, Team of Environmental Epidemiology Applied to the Development and Respiratory Health, Institute for Advanced Biosciences, 38000 Grenoble, France
| | | | - Lea Maitre
- ISGlobal, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), 28029 Madrid, Spain
| | - Jordi Julvez
- ISGlobal, 08003 Barcelona, Spain; CIBEROBN (CIBER Physiopathology of Obesity and Nutrition), Instituto de Salud Carlos III, 28029 Madrid, Spain; Epidemiology and Environmental Health Joint Research Unit, Foundation for the Promotion of Health and Biomedical Research in the Valencian Region, FISABIO-Public Health, FISABIO-Universitat Jaume I-Universitat de València, Av. Catalunya 21, 46020 Valencia, Spain; Institut d'Investigació Sanitària Pere Virgili (IISPV), Clinical and Epidemiological Neuroscience Group (NeuroÈpia), 43204 Reus (Tarragona), Catalonia, Spain
| | | | - Chongliang Luo
- Division of Public Health Sciences, Washington University School of Medicine in St. Louis, 600 S Taylor Ave, St. Louis, MO 63110, USA
| | - Mark Nieuwenhuijsen
- ISGlobal, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), 28029 Madrid, Spain
| | - Cathrine Thomsen
- Department of Food Safety, Norwegian Institute of Public Health (NIPH), Oslo, Norway
| | - Regina Grazuleviciene
- Department of Environmental Science, Vytautas Magnus University, 44248 Kaunas, Lithuania
| | - Barbara Heude
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAE, Center for Research in Epidemiology and StatisticS (CRESS), F-75004 Paris, France
| | - Rosemary McEachan
- Bradford Institute for Health Research, Bradford Teaching Hospitals NHS Foundation Trust, Bradford, UK
| | - Marina Vafeiadi
- Department of Social Medicine, School of Medicine, University of Crete, Heraklion, Crete, Greece
| | - Leda Chatzi
- Department of Social Medicine, School of Medicine, University of Crete, Heraklion, Crete, Greece
| | - John Wright
- Bradford Institute for Health Research, Bradford Teaching Hospitals NHS Foundation Trust, Bradford, UK
| | - Tiffany C Yang
- Bradford Institute for Health Research, Bradford Teaching Hospitals NHS Foundation Trust, Bradford, UK
| | - Rémy Slama
- University Grenoble Alpes, Inserm U 1209, CNRS UMR 5309, Team of Environmental Epidemiology Applied to the Development and Respiratory Health, Institute for Advanced Biosciences, 38000 Grenoble, France
| | - Valérie Siroux
- University Grenoble Alpes, Inserm U 1209, CNRS UMR 5309, Team of Environmental Epidemiology Applied to the Development and Respiratory Health, Institute for Advanced Biosciences, 38000 Grenoble, France
| | - Martine Vrijheid
- ISGlobal, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), 28029 Madrid, Spain
| | - Xavier Basagaña
- ISGlobal, 08003 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain; CIBER Epidemiología y Salud Pública (CIBERESP), 28029 Madrid, Spain.
| |
Collapse
|
3
|
Kim K, Jun TH, Ha BK, Wang S, Sun H. New statistical selection method for pleiotropic variants associated with both quantitative and qualitative traits. BMC Bioinformatics 2023; 24:381. [PMID: 37817069 PMCID: PMC10563219 DOI: 10.1186/s12859-023-05505-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 09/28/2023] [Indexed: 10/12/2023] Open
Abstract
BACKGROUND Identification of pleiotropic variants associated with multiple phenotypic traits has received increasing attention in genetic association studies. Overlapping genetic associations from multiple traits help to detect weak genetic associations missed by single-trait analyses. Many statistical methods were developed to identify pleiotropic variants with most of them being limited to quantitative traits when pleiotropic effects on both quantitative and qualitative traits have been observed. This is a statistically challenging problem because there does not exist an appropriate multivariate distribution to model both quantitative and qualitative data together. Alternatively, meta-analysis methods can be applied, which basically integrate summary statistics of individual variants associated with either a quantitative or a qualitative trait without accounting for correlations among genetic variants. RESULTS We propose a new statistical selection method based on a unified selection score quantifying how a genetic variant, i.e., a pleiotropic variant associates with both quantitative and qualitative traits. In our extensive simulation studies where various types of pleiotropic effects on both quantitative and qualitative traits were considered, we demonstrated that the proposed method outperforms the existing meta-analysis methods in terms of true positive selection. We also applied the proposed method to a peanut dataset with 6 quantitative and 2 qualitative traits, and a cowpea dataset with 2 quantitative and 6 qualitative traits. We were able to detect some potentially pleiotropic variants missed by the existing methods in both analyses. CONCLUSIONS The proposed method is able to locate pleiotropic variants associated with both quantitative and qualitative traits. It has been implemented into an R package 'UNISS', which can be downloaded from http://github.com/statpng/uniss.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, 46241, Busan, Korea
| | - Tae-Hwan Jun
- Department of Plant Bioscience, Pusan National University, 50463, Miryang, Korea
| | - Bo-Keun Ha
- Department of Applied Plant Science, Chonnam National University, 61186, Gwangju, Korea
| | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, 10032, USA
| | - Hokeun Sun
- Department of Statistic, Pusan National University, 46241, Busan, Korea.
| |
Collapse
|
4
|
Guo W, Balakrishnan N, He M. Envelope-based sparse reduced-rank regression for multivariate linear model. J MULTIVARIATE ANAL 2023. [DOI: 10.1016/j.jmva.2023.105159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
5
|
Jiang X, Qiao L, De Leone R, Shen D. Joint selection of brain network nodes and edges for MCI identification. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 225:107082. [PMID: 36055040 DOI: 10.1016/j.cmpb.2022.107082] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 07/20/2022] [Accepted: 08/22/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Functional brain graph (FBG), by describing the interactions between different brain regions, provides an effective representation of fMRI data for identifying mild cognitive impairment (MCI), an early stage of Alzheimer's Disease (AD). Prior to the identification task, selecting features from the estimated FBG is a necessary step for reducing computational cost, alleviating the risk of overfitting, and finding potential biomarkers of brain diseases. In practice, either node-based features (e.g., local clustering coefficients) or edge-based features (e.g., adjacency weights) are generally considered in current studies. Despite their popularity, these schemes can only capture one granularity (node or edge) of information in the FBG, which might be insufficient for the classification task and the interpretation of the classification result. METHODS To address this issue, in this paper, we propose to jointly select nodes and edges from the estimated FBGs. Specifically, we first assign the edges to different node groups. Then, sparse group least absolute shrinkage and selection operator (sgLASSO) is used to select groups (nodes) and edges in the groups towards a better classification performance. Such a technique enables us to simultaneously locate discriminative brain regions, as well as connections between these brain regions, making the classification results more interpretable. RESULTS Experimental results show that the proposed method achieves better classification performance than state-of-the-art methods. Moreover, by exploring brain network "features" that contributed most to MCI identification, we discover potential biomarkers for MCI diagnosis. CONCLUSION A novel method for jointly selecting nodes and edges from the estimated functional brain graphs (FBGs) is proposed.
Collapse
Affiliation(s)
- Xiao Jiang
- School of Science and Technology, University of Camerino, Camerino, Italy; School of Mathematics Science, Liaocheng Univerisity, Liaocheng, China
| | - Lishan Qiao
- School of Mathematics Science, Liaocheng Univerisity, Liaocheng, China; School of Computer Science and Technology, Shandong Jianzhu University, Jinan, China.
| | - Renato De Leone
- School of Science and Technology, University of Camerino, Camerino, Italy.
| | - Dinggang Shen
- School of Biomedical Engineering, ShanghaiTech University, Shanghai, China; Department of Research and Development, Shanghai United Imaging Intelligence Co. Ltd., Shanghai, China; Department of Artificial Intelligence, Korea University, Seoul, South Korea
| |
Collapse
|
6
|
Park S, Lee ER, Zhao H. Low-rank regression models for multiple binary responses and their applications to cancer cell-line encyclopedia data. J Am Stat Assoc 2022; 119:202-216. [PMID: 38481466 PMCID: PMC10928550 DOI: 10.1080/01621459.2022.2105704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 07/16/2022] [Indexed: 10/16/2022]
Abstract
In this paper, we study high-dimensional multivariate logistic regression models in which a common set of covariates is used to predict multiple binary outcomes simultaneously. Our work is primarily motivated from many biomedical studies with correlated multiple responses such as the cancer cell-line encyclopedia project. We assume that the underlying regression coefficient matrix is simultaneously low-rank and row-wise sparse. We propose an intuitively appealing selection and estimation framework based on marginal model likelihood, and we develop an efficient computational algorithm for inference. We establish a novel high-dimensional theory for this nonlinear multivariate regression. Our theory is general, allowing for potential correlations between the binary responses. We propose a new type of nuclear norm penalty using the smooth clipped absolute deviation, filling the gap in the related non-convex penalization literature. We theoretically demonstrate that the proposed approach improves estimation accuracy by considering multiple responses jointly through the proposed estimator when the underlying coefficient matrix is low-rank and row-wise sparse. In particular, we establish the non-asymptotic error bounds, and both rank and row support consistency of the proposed method. Moreover, we develop a consistent rule to simultaneously select the rank and row dimension of the coefficient matrix. Furthermore, we extend the proposed methods and theory to a joint Ising model, which accounts for the dependence relationships. In our analysis of both simulated data and the cancer cell line encyclopedia data, the proposed methods outperform the existing methods in better predicting responses.
Collapse
Affiliation(s)
- Seyoung Park
- Department of Statistics, Sungkyunkwan University, Seoul, 03063, Korea
| | - Eun Ryung Lee
- Department of Statistics, Sungkyunkwan University, Seoul, 03063, Korea
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, CT, 06511, USA
| |
Collapse
|
7
|
Ke H, Ren Z, Qi J, Chen S, Tseng GC, Ye Z, Ma T. High-dimension to high-dimension screening for detecting genome-wide epigenetic and noncoding RNA regulators of gene expression. Bioinformatics 2022; 38:4078-4087. [PMID: 35856716 PMCID: PMC9438953 DOI: 10.1093/bioinformatics/btac518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 06/29/2022] [Accepted: 07/19/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The advancement of high-throughput technology characterizes a wide variety of epigenetic modifications and noncoding RNAs across the genome involved in disease pathogenesis via regulating gene expression. The high dimensionality of both epigenetic/noncoding RNA and gene expression data make it challenging to identify the important regulators of genes. Conducting univariate test for each possible regulator-gene pair is subject to serious multiple comparison burden, and direct application of regularization methods to select regulator-gene pairs is computationally infeasible. Applying fast screening to reduce dimension first before regularization is more efficient and stable than applying regularization methods alone. RESULTS We propose a novel screening method based on robust partial correlation to detect epigenetic and noncoding RNA regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses. Compared to existing screening methods, our method is conceptually innovative that it reduces the dimension of both predictor and response, and screens at both node (regulators or genes) and edge (regulator-gene pairs) levels. We develop data-driven procedures to determine the conditional sets and the optimal screening threshold, and implement a fast iterative algorithm. Simulations and applications to long noncoding RNA and microRNA regulation in Kidney cancer and DNA methylation regulation in Glioblastoma Multiforme illustrate the validity and advantage of our method. AVAILABILITY AND IMPLEMENTATION The R package, related source codes and real datasets used in this article are provided at https://github.com/kehongjie/rPCor. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongjie Ke
- Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, USA
| | - Zhao Ren
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Jianfei Qi
- Department of Biochemistry and Molecular Biology, University of Maryland, Baltimore, MD 21201, USA
| | - Shuo Chen
- Department of Epidemiology & Public Health, University of Maryland, Baltimore, MD 21201, USA
| | - George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Zhenyao Ye
- Department of Epidemiology & Public Health, University of Maryland, Baltimore, MD 21201, USA
| | - Tianzhou Ma
- Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
8
|
Qian J, Tanigawa Y, Li R, Tibshirani R, Rivas MA, Hastie T. LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK. Ann Appl Stat 2022; 16:1891-1918. [PMID: 36091495 PMCID: PMC9454085 DOI: 10.1214/21-aoas1575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.
Collapse
Affiliation(s)
| | | | - Ruilin Li
- Institute for Computational and Mathematical Engineering, Stanford University
| | | | - Manuel A Rivas
- Department of Biomedical Data Science, Stanford University
| | | |
Collapse
|
9
|
Sparse reduced-rank regression for simultaneous rank and variable selection via manifold optimization. Comput Stat 2022. [DOI: 10.1007/s00180-022-01216-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
AbstractWe consider the problem of constructing a reduced-rank regression model whose coefficient parameter is represented as a singular value decomposition with sparse singular vectors. The traditional estimation procedure for the coefficient parameter often fails when the true rank of the parameter is high. To overcome this issue, we develop an estimation algorithm with rank and variable selection via sparse regularization and manifold optimization, which enables us to obtain an accurate estimation of the coefficient parameter even if the true rank of the coefficient parameter is high. Using sparse regularization, we can also select an optimal value of the rank. We conduct Monte Carlo experiments and a real data analysis to illustrate the effectiveness of our proposed method.
Collapse
|
10
|
Xie S, McDonnell E, Wang Y. Conditional Gaussian graphical model for estimating personalized disease symptom networks. Stat Med 2022; 41:543-553. [PMID: 34866214 PMCID: PMC8792223 DOI: 10.1002/sim.9274] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 10/13/2021] [Accepted: 11/15/2021] [Indexed: 11/10/2022]
Abstract
The co-occurrence of symptoms may result from the direct interactions between these symptoms and the symptoms can be treated as a system. In addition, subject-specific risk factors (eg, genetic variants, age) can also exert external influence on the system. In this work, we develop a covariate-dependent conditional Gaussian graphical model to obtain personalized symptom networks. The strengths of network connections are modeled as a function of covariates to capture the heterogeneity among individuals and subgroups of individuals. We assess the performance of our proposed method by simulation studies and an application to a large natural history study of Huntington's disease to investigate the networks of symptoms in multiple clinical domains (motor, cognitive, psychiatric) and identify important brain imaging biomarkers that are associated with the connections. We show that the symptoms in the same clinical domain interact more often with each other than cross domains and the psychiatric subnetwork is the densest network. We validate the findings using the subjects' symptom measurements at follow-up visits.
Collapse
Affiliation(s)
- Shanghong Xie
- School of Statistics and Center of Statistical Research, Southwestern University of Finance and Economics, Chengdu, China
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, U.S.A
| | - Erin McDonnell
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, U.S.A
| | - Yuanjia Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, U.S.A
- Department of Psychiatry, Columbia University Medical Center, New York, NY, U.S.A
| |
Collapse
|
11
|
Gong Y, Chen Z. A sequential approach to feature selection in high-dimensional additive models. J Stat Plan Inference 2021. [DOI: 10.1016/j.jspi.2021.04.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
12
|
Chen Y, Luo Z, Kong L. ℓ2,0-norm based selection and estimation for multivariate generalized linear models. J MULTIVARIATE ANAL 2021. [DOI: 10.1016/j.jmva.2021.104782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
13
|
Molstad AJ, Sun W, Hsu L. A COVARIANCE-ENHANCED APPROACH TO MULTI-TISSUE JOINT EQTL MAPPING WITH APPLICATION TO TRANSCRIPTOME-WIDE ASSOCIATION STUDIES. Ann Appl Stat 2021; 15:998-1016. [PMID: 34413922 DOI: 10.1214/20-aoas1432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Transcriptome-wide association studies based on genetically predicted gene expression have the potential to identify novel regions associated with various complex traits. It has been shown that incorporating expression quantitative trait loci (eQTLs) corresponding to multiple tissue types can improve power for association studies involving complex etiology. In this article, we propose a new multivariate response linear regression model and method for predicting gene expression in multiple tissues simultaneously. Unlike existing methods for multi-tissue joint eQTL mapping, our approach incorporates tissue-tissue expression correlation, which allows us to more efficiently handle missing expression measurements and more accurately predict gene expression using a weighted summation of eQTL genotypes. We show through simulation studies that our approach performs better than the existing methods in many scenarios. We use our method to estimate eQTL weights for 29 tissues collected by GTEx, and show that our approach significantly improves expression prediction accuracy compared to competitors. Using our eQTL weights, we perform a multi-tissue-based S-MultiXcan [2] transcriptome-wide association study and show that our method leads to more discoveries in novel regions and more discoveries overall than the existing methods. Estimated eQTL weights and code for implementing the method are available for download online at github.com/ajmolstad/MTeQTLResults.
Collapse
|
14
|
Rauschenberger A, Glaab E. Predicting correlated outcomes from molecular data. Bioinformatics 2021; 37:3889-3895. [PMID: 34358294 DOI: 10.1093/bioinformatics/btab576] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 07/14/2021] [Accepted: 08/05/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Multivariate (multi-target) regression has the potential to outperform univariate (single-target) regression at predicting correlated outcomes, which frequently occur in biomedical and clinical research. Here we implement multivariate lasso and ridge regression using stacked generalisation. RESULTS Our flexible approach leads to predictive and interpretable models in high-dimensional settings, with a single estimate for each input-output effect. In the simulation, we compare the predictive performance of several state-of-the-art methods for multivariate regression. In the application, we use clinical and genomic data to predict multiple motor and non-motor symptoms in Parkinson's disease patients. We conclude that stacked multivariate regression, with our adaptations, is a competitive method for predicting correlated outcomes. AVAILABILITY AND IMPLEMENTATION The R package joinet is available on GitHub (https://github.com/rauschenberger/joinet) and cran (https://cran.r-project.org/package=joinet). SUPPLEMENTARY INFORMATION Supplementary tables and figures are available at Bioinformatics online.
Collapse
Affiliation(s)
- Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (lcsb), University of Luxembourg, Esch-sur-Alzette, 4362, Luxembourg
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (lcsb), University of Luxembourg, Esch-sur-Alzette, 4362, Luxembourg
| |
Collapse
|
15
|
Zhang S, Hu X, Luo Z, Jiang Y, Sun Y, Ma S. Biomarker-guided heterogeneity analysis of genetic regulations via multivariate sparse fusion. Stat Med 2021; 40:3915-3936. [PMID: 33906263 PMCID: PMC8277716 DOI: 10.1002/sim.9006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Revised: 04/07/2021] [Accepted: 04/07/2021] [Indexed: 11/06/2022]
Abstract
Heterogeneity is a hallmark of many complex diseases. There are multiple ways of defining heterogeneity, among which the heterogeneity in genetic regulations, for example, gene expressions (GEs) by copy number variations (CNVs), and methylation, has been suggested but little investigated. Heterogeneity in genetic regulations can be linked with disease severity, progression, and other traits and is biologically important. However, the analysis can be very challenging with the high dimensionality of both sides of regulation as well as sparse and weak signals. In this article, we consider the scenario where subjects form unknown subgroups, and each subgroup has unique genetic regulation relationships. Further, such heterogeneity is "guided" by a known biomarker. We develop a multivariate sparse fusion (MSF) approach, which innovatively applies the penalized fusion technique to simultaneously determine the number and structure of subgroups and regulation relationships within each subgroup. An effective computational algorithm is developed, and extensive simulations are conducted. The analysis of heterogeneity in the GE-CNV regulations in melanoma and GE-methylation regulations in stomach cancer using the TCGA data leads to interesting findings.
Collapse
Affiliation(s)
- Sanguo Zhang
- School of Mathematical Sciences, and Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Science, Beijing, China
| | - Xiaonan Hu
- School of Mathematical Sciences, and Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Science, Beijing, China
| | - Ziye Luo
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Tennessee, USA
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
- Department of Biostatistics, Yale University, Connecticut, USA
| |
Collapse
|
16
|
Diaz-Ramirez LG, Lee SJ, Smith AK, Gan S, Boscardin WJ. A Novel Method for Identifying a Parsimonious and Accurate Predictive Model for Multiple Clinical Outcomes. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 204:106073. [PMID: 33831724 PMCID: PMC8098121 DOI: 10.1016/j.cmpb.2021.106073] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 03/22/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND AND OBJECTIVE Most methods for developing clinical prognostic models focus on identifying parsimonious and accurate models to predict a single outcome; however, patients and providers often want to predict multiple outcomes simultaneously. As an example, for older adults one is often interested in predicting nursing home admission as well as mortality. We propose and evaluate a novel predictor-selection computing method for multiple outcomes and provide the code for its implementation. METHODS Our proposed algorithm selected the best subset of common predictors based on the minimum average normalized Bayesian Information Criterion (BIC) across outcomes: the Best Average BIC (baBIC) method. We compared the predictive accuracy (Harrell's C-statistic) and parsimony (number of predictors) of the model obtained using the baBIC method with: 1) a subset of common predictors obtained from the union of optimal models for each outcome (Union method), 2) a subset obtained from the intersection of optimal models for each outcome (Intersection method), and 3) a model with no variable selection (Full method). We used a case-study data from the Health and Retirement Study (HRS) to demonstrate our method and conducted a simulation study to investigate performance. RESULTS In the case-study data and simulations, the average Harrell's C-statistics across outcomes of the models obtained with the baBIC and Union methods were comparable. Despite the similar discrimination, the baBIC method produced more parsimonious models than the Union method. In contrast, the models selected with the Intersection method were the most parsimonious, but with worst predictive accuracy, and the opposite was true in the Full method. In the simulations, the baBIC method performed well by identifying many of the predictors selected in the baBIC model of the case-study data most of the time and excluding those not selected in the majority of the simulations. CONCLUSIONS Our method identified a common subset of variables to predict multiple clinical outcomes with superior balance between parsimony and predictive accuracy to current methods.
Collapse
Affiliation(s)
- L Grisell Diaz-Ramirez
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| | - Sei J Lee
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| | - Alexander K Smith
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| | - Siqi Gan
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| | - W John Boscardin
- Division of Geriatrics, University of California, San Francisco, 490 Illinois Street, Floor 08, Box 1265, San Francisco, CA 94143, United States; San Francisco Veterans Affairs (VA) Medical Center, 4150 Clement Street, 181G, San Francisco, CA 94121, United States.
| |
Collapse
|
17
|
Zhou Y, Song PXK, Wen X. Structural factor equation models for causal network construction via directed acyclic mixed graphs. Biometrics 2021; 77:573-586. [PMID: 32627167 PMCID: PMC8240035 DOI: 10.1111/biom.13322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2018] [Accepted: 05/29/2020] [Indexed: 11/30/2022]
Abstract
Directed acyclic mixed graphs (DAMGs) provide a useful representation of network topology with both directed and undirected edges subject to the restriction of no directed cycles in the graph. This graphical framework may arise in many biomedical studies, for example, when a directed acyclic graph (DAG) of interest is contaminated with undirected edges induced by some unobserved confounding factors (eg, unmeasured environmental factors). Directed edges in a DAG are widely used to evaluate causal relationships among variables in a network, but detecting them is challenging when the underlying causality is obscured by some shared latent factors. The objective of this paper is to develop an effective structural equation model (SEM) method to extract reliable causal relationships from a DAMG. The proposed approach, termed structural factor equation model (SFEM), uses the SEM to capture the network topology of the DAG while accounting for the undirected edges in the graph with a factor analysis model. The latent factors in the SFEM enable the identification and removal of undirected edges, leading to a simpler and more interpretable causal network. The proposed method is evaluated and compared to existing methods through extensive simulation studies, and illustrated through the construction of gene regulatory networks related to breast cancer.
Collapse
Affiliation(s)
- Yan Zhou
- Gilead Sciences, Foster City, California
| | - Peter X.-K. Song
- Department of Biostatistics, University of Michigan, Ann Arbor, MI
| | - Xiaoquan Wen
- Department of Biostatistics, University of Michigan, Ann Arbor, MI
| |
Collapse
|
18
|
Liu Y, Ye X, Zhan X, Yu CY, Zhang J, Huang K. TPQCI: A topology potential-based method to quantify functional influence of copy number variations. Methods 2021; 192:46-56. [PMID: 33894380 DOI: 10.1016/j.ymeth.2021.04.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 04/18/2021] [Accepted: 04/19/2021] [Indexed: 12/21/2022] Open
Abstract
Copy number variation (CNV) is a major type of chromosomal structural variation that play important roles in many diseases including cancers. Due to genome instability, a large number of CNV events can be detected in diseases such as cancer. Therefore, it is important to identify the functionally important CNVs in diseases, which currently still poses a challenge in genomics. One of the critical steps to solve the problem is to define the influence of CNV. In this paper, we provide a topology potential based method, TPQCI, to quantify this kind of influence by integrating statistics, gene regulatory associations, and biological function information. We used this metric to detect functionally enriched genes on genomic segments with CNV in breast cancer and multiple myeloma and discovered biological functions influenced by CNV. Our results demonstrate that, by using our proposed TPQCI metric, we can detect disease-specific genes that are influenced by CNVs. Source codes of TPQCI are provided in Github (https://github.com/usos/TPQCI).
Collapse
Affiliation(s)
- Yusong Liu
- Collage of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, Heilongjiang 150001, China; Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Xiufen Ye
- Collage of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin, Heilongjiang 150001, China
| | - Xiaohui Zhan
- Indiana University School of Medicine, Indianapolis, IN 46202, USA; National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, Guangdong 518037, China; Department of Bioinformatics, School of Basic Medicine, Chongqing Medical University, Chongqing 400016, China
| | - Christina Y Yu
- Indiana University School of Medicine, Indianapolis, IN 46202, USA; Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
| | - Jie Zhang
- Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Kun Huang
- Indiana University School of Medicine, Indianapolis, IN 46202, USA; Regenstrief Institute, Indianapolis, IN 46202, USA.
| |
Collapse
|
19
|
Mbebi AJ, Tong H, Nikoloski Z. L2,1-norm regularized multivariate regression model with applications to genomic prediction. Bioinformatics 2021; 37:2896-2904. [PMID: 33774677 PMCID: PMC8479665 DOI: 10.1093/bioinformatics/btab212] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Revised: 03/16/2021] [Accepted: 03/26/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Genomic selection (GS) is currently deemed the most effective approach to speed up breeding of agricultural varieties. It has been recognized that consideration of multiple traits in GS can improve accuracy of prediction for traits of low heritability. However, since GS forgoes statistical testing with the idea of improving predictions, it does not facilitate mechanistic understanding of the contribution of particular single nucleotide polymorphisms (SNP). RESULTS Here, we propose a L2,1-norm regularized multivariate regression model and devise a fast and efficient iterative optimization algorithm, called L2,1-joint, applicable in multi-trait GS. The usage of the L2,1-norm facilitates variable selection in a penalized multivariate regression that considers the relation between individuals, when the number of SNPs is much larger than the number of individuals. The capacity for variable selection allows us to define master regulators that can be used in a multi-trait GS setting to dissect the genetic architecture of the analyzed traits. Our comparative analyses demonstrate that the proposed model is a favorable candidate compared to existing state-of-the-art approaches. Prediction and variable selection with datasets from Brassica napus, wheat and Arabidopsis thaliana diversity panels are conducted to further showcase the performance of the proposed model. AVAILABILITY AND IMPLEMENTATION : The model is implemented using R programming language and the code is freely available from https://github.com/alainmbebi/L21-norm-GS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alain J Mbebi
- Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam-Golm, Germany,Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, 14476 Potsdam-Golm, Germany
| | - Hao Tong
- Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam-Golm, Germany,Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, 14476 Potsdam-Golm, Germany,Center for Plant Systems Biology and Biotechnology, Ruski 139, 4000 Tsentar, Plovdiv, Bulgaria
| | - Zoran Nikoloski
- Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam-Golm, Germany,Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, 14476 Potsdam-Golm, Germany,Center for Plant Systems Biology and Biotechnology, Ruski 139, 4000 Tsentar, Plovdiv, Bulgaria,To whom correspondence should be addressed.
| |
Collapse
|
20
|
Molstad AJ, Weng G, Doss CR, Rothman AJ. An Explicit Mean-Covariance Parameterization for Multivariate Response Linear Regression. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2020.1853551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Aaron J. Molstad
- Department of Statistics and Genetics Institute, University of Florida, Gainesville, FL
| | - Guangwei Weng
- School of Statistics, University of Minnesota, Minneapolis, MN
| | - Charles R. Doss
- School of Statistics, University of Minnesota, Minneapolis, MN
| | - Adam J. Rothman
- School of Statistics, University of Minnesota, Minneapolis, MN
| |
Collapse
|
21
|
Zhang J, Oftadeh E. Multivariate variable selection by means of null-beamforming. Electron J Stat 2021. [DOI: 10.1214/21-ejs1859] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Jian Zhang
- School of Mathematics, Statistics and Actuarial Science, University of Kent, Canterbury, Kent CT2 7FS, U.K
| | - Elaheh Oftadeh
- School of Mathematics, Statistics and Actuarial Science, University of Kent, Canterbury, Kent CT2 7FS, U.K
| |
Collapse
|
22
|
Abstract
In recent biomedical studies, multidimensional profiling, which collects proteomics as well as other types of omics data on the same subjects, is getting increasingly popular. Proteomics, transcriptomics, genomics, epigenomics, and other types of data contain overlapping as well as independent information, which suggests the possibility of integrating multiple types of data to generate more reliable findings/models with better classification/prediction performance. In this chapter, a selective review is conducted on recent data integration techniques for both unsupervised and supervised analysis. The main objective is to provide the "big picture" of data integration that involves proteomics data and discuss the "intuition" beneath the recently developed approaches without invoking too many mathematical details. Potential pitfalls and possible directions for future developments are also discussed.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, TN, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, USA.
| |
Collapse
|
23
|
Guo W, Balakrishnan N, Bian M. Reduced rank regression with matrix projections for high-dimensional multivariate linear regression model. Electron J Stat 2021. [DOI: 10.1214/21-ejs1895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Wenxing Guo
- Department of Mathematics and Statistics, McMaster University, Hamilton, ON, L8S 4K1, Canada
| | | | - Mengjie Bian
- Department of Mathematics and Statistics, McMaster University, Hamilton, ON, L8S 4K1, Canada
| |
Collapse
|
24
|
Mokhtaridoost M, Gönen M. An efficient framework to identify key miRNA-mRNA regulatory modules in cancer. Bioinformatics 2020; 36:i592-i600. [PMID: 33381822 DOI: 10.1093/bioinformatics/btaa798] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Micro-RNAs (miRNAs) are known as the important components of RNA silencing and post-transcriptional gene regulation, and they interact with messenger RNAs (mRNAs) either by degradation or by translational repression. miRNA alterations have a significant impact on the formation and progression of human cancers. Accordingly, it is important to establish computational methods with high predictive performance to identify cancer-specific miRNA-mRNA regulatory modules. RESULTS We presented a two-step framework to model miRNA-mRNA relationships and identify cancer-specific modules between miRNAs and mRNAs from their matched expression profiles of more than 9000 primary tumors. We first estimated the regulatory matrix between miRNA and mRNA expression profiles by solving multiple linear programming problems. We then formulated a unified regularized factor regression (RFR) model that simultaneously estimates the effective number of modules (i.e. latent factors) and extracts modules by decomposing regulatory matrix into two low-rank matrices. Our RFR model groups correlated miRNAs together and correlated mRNAs together, and also controls sparsity levels of both matrices. These attributes lead to interpretable results with high predictive performance. We applied our method on a very comprehensive data collection by including 32 TCGA cancer types. To find the biological relevance of our approach, we performed functional gene set enrichment and survival analyses. A large portion of the identified modules are significantly enriched in Hallmark, PID and KEGG pathways/gene sets. To validate the identified modules, we also performed literature validation as well as validation using experimentally supported miRTarBase database. AVAILABILITY AND IMPLEMENTATION Our implementation of proposed two-step RFR algorithm in R is available at https://github.com/MiladMokhtaridoost/2sRFR together with the scripts that replicate the reported experiments. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Mehmet Gönen
- Department of Industrial Engineering, College of Engineering, İstanbul 34450, Turkey.,School of Medicine, Koç University, İstanbul 34450, Turkey.,Department of Biomedical Engineering, School of Medicine, Oregon Health & Science University, Portland, OR 97239, USA
| |
Collapse
|
25
|
Cho SB. Set-Wise Differential Interaction Between Copy Number Alterations and Gene Expressions of Lower-Grade Glioma Reveals Prognosis-Associated Pathways. ENTROPY 2020; 22:e22121434. [PMID: 33353229 PMCID: PMC7765960 DOI: 10.3390/e22121434] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 11/30/2020] [Accepted: 12/16/2020] [Indexed: 12/22/2022]
Abstract
The integrative analysis of copy number alteration (CNA) and gene expression (GE) is an essential part of cancer research considering the impact of CNAs on cancer progression and prognosis. In this research, an integrative analysis was performed with generalized differentially coexpressed gene sets (gdCoxS), which is a modification of dCoxS. In gdCoxS, set-wise interaction is measured using the correlation of sample-wise distances with Renyi’s relative entropy, which requires an estimation of sample density based on omics profiles. To capture correlations between the variables, multivariate density estimation with covariance was applied. In the simulation study, the power of gdCoxS outperformed dCoxS that did not use the correlations in the density estimation explicitly. In the analysis of the lower-grade glioma of the cancer genome atlas program (TCGA-LGG) data, the gdCoxS identified 577 pathway CNAs and GEs pairs that showed significant changes of interaction between the survival and non-survival group, while other benchmark methods detected lower numbers of such pathways. The biological implications of the significant pathways were well consistent with previous reports of the TCGA-LGG. Taken together, the gdCoxS is a useful method for an integrative analysis of CNAs and GEs.
Collapse
Affiliation(s)
- Seong Beom Cho
- Department of Biomedical Informatics, College of Medicine, Gachon University, Seongnam-Daero 1342, Korea
| |
Collapse
|
26
|
Wang H, Wu Y, Fang R, Sa J, Li Z, Cao H, Cui Y. Time-Varying Gene Network Analysis of Human Prefrontal Cortex Development. Front Genet 2020; 11:574543. [PMID: 33304381 PMCID: PMC7701309 DOI: 10.3389/fgene.2020.574543] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2020] [Accepted: 10/19/2020] [Indexed: 11/13/2022] Open
Abstract
The prefrontal cortex (PFC) constitutes a large part of the human central nervous system and is essential for the normal social affection and executive function of humans and other primates. Despite ongoing research in this region, the development of interactions between PFC genes over the lifespan is still unknown. To investigate the conversion of PFC gene interaction networks and further identify hub genes, we obtained time-series gene expression data of human PFC tissues from the Gene Expression Omnibus (GEO) database. A statistical model, loggle, was used to construct time-varying networks and several common network attributes were used to explore the development of PFC gene networks with age. Network similarity analysis showed that the development of human PFC is divided into three stages, namely, fast development period, deceleration to stationary period, and recession period. We identified some genes related to PFC development at these different stages, including genes involved in neuronal differentiation or synapse formation, genes involved in nerve impulse transmission, and genes involved in the development of myelin around neurons. Some of these genes are consistent with findings in previous reports. At the same time, we explored the development of several known KEGG pathways in PFC and corresponding hub genes. This study clarified the development trajectory of the interaction between PFC genes, and proposed a set of candidate genes related to PFC development, which helps further study of human brain development at the genomic level supplemental to regular anatomical analyses. The analytical process used in this study, involving the loggle model, similarity analysis, and central analysis, provides a comprehensive strategy to gain novel insights into the evolution and development of brain networks in other organisms.
Collapse
Affiliation(s)
- Huihui Wang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yongqing Wu
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Ruiling Fang
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Jian Sa
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Zhi Li
- Department of Hematology, Taiyuan Central Hospital of Shanxi Medical University, Taiyuan, China
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
27
|
Huang S, Blatti C, Sinha S, Parameswaran A. Uncovering Effective Explanations for Interactive Genomic Data Analysis. PATTERNS 2020; 1:100093. [PMID: 33205133 PMCID: PMC7660438 DOI: 10.1016/j.patter.2020.100093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/13/2020] [Accepted: 08/05/2020] [Indexed: 10/25/2022]
|
28
|
Liu H, Sunil Rao J. Generalized finite mixture of multivariate regressions with applications to therapeutic biomarker identification. Stat Med 2020; 39:4301-4324. [DOI: 10.1002/sim.8726] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Revised: 06/01/2020] [Accepted: 07/19/2020] [Indexed: 11/06/2022]
Affiliation(s)
- Hongmei Liu
- Division of Biostatistics University of Miami Coral Gables Florida USA
| | - J. Sunil Rao
- Division of Biostatistics University of Miami Coral Gables Florida USA
| |
Collapse
|
29
|
Feng Y, Xiao L, Chi EC. Sparse Single Index Models for Multivariate Responses. J Comput Graph Stat 2020; 30:115-124. [PMID: 34025100 DOI: 10.1080/10618600.2020.1779080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Joint models are popular for analyzing data with multivariate responses. We propose a sparse multivariate single index model, where responses and predictors are linked by unspecified smooth functions and multiple matrix level penalties are employed to select predictors and induce low-rank structures across responses. An alternating direction method of multipliers (ADMM) based algorithm is proposed for model estimation. We demonstrate the effectiveness of proposed model in simulation studies and an application to a genetic association study.
Collapse
Affiliation(s)
- Yuan Feng
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| | - Luo Xiao
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| | - Eric C Chi
- Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203
| |
Collapse
|
30
|
Hilafu H, Safo SE, Haine L. Sparse reduced-rank regression for integrating omics data. BMC Bioinformatics 2020; 21:283. [PMID: 32620072 PMCID: PMC7333421 DOI: 10.1186/s12859-020-03606-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Accepted: 06/16/2020] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND The problem of assessing associations between multiple omics data including genomics and metabolomics data to identify biomarkers potentially predictive of complex diseases has garnered considerable research interest nowadays. A popular epidemiology approach is to consider an association of each of the predictors with each of the response using a univariate linear regression model, and to select predictors that meet a priori specified significance level. Although this approach is simple and intuitive, it tends to require larger sample size which is costly. It also assumes variables for each data type are independent, and thus ignores correlations that exist between variables both within each data type and across the data types. RESULTS We consider a multivariate linear regression model that relates multiple predictors with multiple responses, and to identify multiple relevant predictors that are simultaneously associated with the responses. We assume the coefficient matrix of the responses on the predictors is both row-sparse and of low-rank, and propose a group Dantzig type formulation to estimate the coefficient matrix. CONCLUSION Extensive simulations demonstrate the competitive performance of our proposed method when compared to existing methods in terms of estimation, prediction, and variable selection. We use the proposed method to integrate genomics and metabolomics data to identify genetic variants that are potentially predictive of atherosclerosis cardiovascular disease (ASCVD) beyond well-established risk factors. Our analysis shows some genetic variants that increase prediction of ASCVD beyond some well-established factors of ASCVD, and also suggest a potential utility of the identified genetic variants in explaining possible association between certain metabolites and ASCVD.
Collapse
Affiliation(s)
- Haileab Hilafu
- Department of Business Analytics and Statistics, University of Tennessee, Knoxville, 37996 TN USA
| | - Sandra E. Safo
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455 MN USA
| | - Lillian Haine
- Division of Biostatistics, University of Minnesota, Minneapolis, 55455 MN USA
| |
Collapse
|
31
|
Alpay BA, Demetci P, Istrail S, Aguiar D. Combinatorial and statistical prediction of gene expression from haplotype sequence. Bioinformatics 2020; 36:i194-i202. [PMID: 32657373 PMCID: PMC7355230 DOI: 10.1093/bioinformatics/btaa318] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experimental investigation. RESULTS In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that models haplotype sharing with a modified suffix tree data structure and computes expression groups by spectral clustering. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD with three state-of-the-art expression prediction methods and two novel logistic regression approaches across five GTEx v8 tissues. HAPLEXD exhibits significantly higher classification accuracy overall; HAPLEXR shows higher prediction accuracy on approximately half of the genes tested and the largest number of best predicted genes (r2>0.1) among all methods. We show that variant and haplotype features selected by HAPLEXR are smaller in size than competing methods (and thus more interpretable) and are significantly enriched in functional annotations related to gene regulation. These results demonstrate the importance of explicitly modeling non-dosage dependent and intragenic epistatic effects when predicting expression. AVAILABILITY AND IMPLEMENTATION Source code and binaries are freely available at https://github.com/rapturous/HAPLEX. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Berk A Alpay
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| | - Pinar Demetci
- Department of Computer Science and Center for Computational Biology, Brown University, Providence, RI 02912, USA
| | - Sorin Istrail
- Department of Computer Science and Center for Computational Biology, Brown University, Providence, RI 02912, USA
| | - Derek Aguiar
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|
32
|
Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data. STATISTICS IN BIOSCIENCES 2020; 12:376-398. [PMID: 33796162 DOI: 10.1007/s12561-020-09284-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
The threshold regression model is an effective alternative to the Cox proportional hazards regression model when the proportional hazards assumption is not met. This paper considers variable selection for threshold regression. This model has separate regression functions for the initial health status and the speed of degradation in health. This flexibility is an important advantage when considering relevant risk factors for a complex time-to-event model where one needs to decide which variables should be included in the regression function for the initial health status, in the function for the speed of degradation in health, or in both functions. In this paper, we extend the broken adaptive ridge (BAR) method, originally designed for variable selection for one regression function, to simultaneous variable selection for both regression functions needed in the threshold regression model. We establish variable selection consistency of the proposed method and asymptotic normality of the estimator of non-zero regression coefficients. Simulation results show that our method outperformed threshold regression without variable selection and variable selection based on the Akaike information criterion. We apply the proposed method to data from an HIV drug adherence study in which electronic monitoring of drug intake is used to identify risk factors for non- adherence.
Collapse
|
33
|
|
34
|
Kong D, An B, Zhang J, Zhu H. L2RM: Low-rank Linear Regression Models for High-dimensional Matrix Responses. J Am Stat Assoc 2020; 115:403-424. [PMID: 33408427 PMCID: PMC7781207 DOI: 10.1080/01621459.2018.1555092] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2017] [Revised: 11/11/2018] [Accepted: 11/26/2018] [Indexed: 10/27/2022]
Abstract
The aim of this paper is to develop a low-rank linear regression model (L2RM) to correlate a high-dimensional response matrix with a high dimensional vector of covariates when coefficient matrices have low-rank structures. We propose a fast and efficient screening procedure based on the spectral norm of each coefficient matrix in order to deal with the case when the number of covariates is extremely large. We develop an efficient estimation procedure based on the trace norm regularization, which explicitly imposes the low rank structure of coefficient matrices. When both the dimension of response matrix and that of covariate vector diverge at the exponential order of the sample size, we investigate the sure independence screening property under some mild conditions. We also systematically investigate some theoretical properties of our estimation procedure including estimation consistency, rank consistency and non-asymptotic error bound under some mild conditions. We further establish a theoretical guarantee for the overall solution of our two-step screening and estimation procedure. We examine the finite-sample performance of our screening and estimation methods using simulations and a large-scale imaging genetic dataset collected by the Philadelphia Neurodevelopmental Cohort (PNC) study.
Collapse
Affiliation(s)
- Dehan Kong
- Department of Statistical Sciences, University of Toronto
| | - Baiguo An
- School of Statistics, Capital University of Economics and Business
| | - Jingwen Zhang
- Department of Biostatistics, University of North Carolina at Chapel Hill
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill
| |
Collapse
|
35
|
Oda R, Yanagihara H. A fast and consistent variable selection method for high-dimensional multivariate linear regression with a large number of explanatory variables. Electron J Stat 2020. [DOI: 10.1214/20-ejs1701] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
36
|
Navon A, Rosset S. Capturing between-tasks covariance and similarities using multivariate linear mixed models. Electron J Stat 2020. [DOI: 10.1214/20-ejs1764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
37
|
Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ, Jiang Y. Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities. Front Genet 2019; 10:995. [PMID: 31781153 PMCID: PMC6857202 DOI: 10.3389/fgene.2019.00995] [Citation(s) in RCA: 87] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 09/18/2019] [Indexed: 12/21/2022] Open
Abstract
The advent of large-scale microbiome studies affords newfound analytical opportunities to understand how these communities of microbes operate and relate to their environment. However, the analytical methodology needed to model microbiome data and integrate them with other data constructs remains nascent. This emergent analytical toolset frequently ports over techniques developed in other multi-omics investigations, especially the growing array of statistical and computational techniques for integrating and representing data through networks. While network analysis has emerged as a powerful approach to modeling microbiome data, oftentimes by integrating these data with other types of omics data to discern their functional linkages, it is not always evident if the statistical details of the approach being applied are consistent with the assumptions of microbiome data or how they impact data interpretation. In this review, we overview some of the most important network methods for integrative analysis, with an emphasis on methods that have been applied or have great potential to be applied to the analysis of multi-omics integration of microbiome data. We compare advantages and disadvantages of various statistical tools, assess their applicability to microbiome data, and discuss their biological interpretability. We also highlight on-going statistical challenges and opportunities for integrative network analysis of microbiome data.
Collapse
Affiliation(s)
- Duo Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Courtney R Armour
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Chenxiao Hu
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Meng Mei
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Chuan Tian
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Thomas J Sharpton
- Department of Statistics, Oregon State University, Corvallis, OR, United States
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Yuan Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| |
Collapse
|
38
|
Fang K, Zhang X, Ma S, Zhang Q. Smooth and Locally Sparse Estimation for Multiple-Output Functional Linear Regression. J STAT COMPUT SIM 2019; 90:341-354. [PMID: 33012883 DOI: 10.1080/00949655.2019.1680676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Functional data analysis has attracted substantial research interest and the goal of functional sparsity is to produce a sparse estimate which assigns zero values over regions where the true underlying function is zero, i.e., no relationship between the response variable and the predictor variable. In this paper, we consider a functional linear regression models that explicitly incorporates the interconnections among the responses. We propose a locally sparse (i.e., zero on some subregions) estimator, multiple-smooth and locally sparse (m-SLoS) estimator, for coefficient functions base on the interconnections among the responses. This method is based on a combination of smooth and locally sparse (SLoS) estimator and Laplacian quadratic penalty function, where we used SLoS for encouraging locally sparse and Laplacian quadratic penalty for promoting similar locally sparse among coefficient functions associated with the interconnections among the responses. Simulations show excellent numerical performance of the proposed method in terms of the estimation of coefficient functions especially the coefficient functions are same for all multivariate responses. Practical merit of this modeling is demonstrated by one real application and the prediction shows significant improvements.
Collapse
Affiliation(s)
- Kuangnan Fang
- Department of Statistics, School of Economics, Xiamen University, China.,Key Laboratory of Econometrics, Ministry of Education, Xiamen University, China
| | - Xiaochen Zhang
- Department of Statistics, School of Economics, Xiamen University, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, USA
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, China.,Key Laboratory of Econometrics, Ministry of Education, Xiamen University, China.,The Wang Yanan Institute for Studies in Economics, Xiamen University, China
| |
Collapse
|
39
|
Lu M. An embedded method for gene identification problems involving unwanted data heterogeneity. Hum Genomics 2019; 13:45. [PMID: 31639059 PMCID: PMC6805328 DOI: 10.1186/s40246-019-0228-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Modern applications such as bioinformatics collecting data in various ways can easily result in heterogeneous data. Traditional variable selection methods assume samples are independent and identically distributed, which however is not suitable for these applications. Some existing statistical models capable of taking care of unwanted variation were developed for gene identification involving heterogeneous data, but they lack model predictability and suffer from variable redundancy. RESULTS By accounting for the unwanted heterogeneity effectively, our method have shown its superiority over several state-of-the art methods, which is validated by the experimental results in both unsupervised and supervised gene identification problems. Moreover, we also applied our method to a pan-cancer study where our method can identify the most discriminative genes best distinguishing different cancer types. CONCLUSIONS This article provides an alternative gene identification method that can accounting for unwanted data heterogeneity. It is a promising method to provide new insights into the complex cancer biology and clues for understanding tumorigenesis and tumor progression.
Collapse
Affiliation(s)
- Meng Lu
- Department of Information Management,Tianjin University, Tianjin, China.
| |
Collapse
|
40
|
Zhou S, Zhou J, Zhang B. Overlapping group lasso for high-dimensional generalized linear models. COMMUN STAT-THEOR M 2019. [DOI: 10.1080/03610926.2018.1500604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Shengbin Zhou
- Department of Statistics, Harbin Normal University, Harbin, China
| | - Jingke Zhou
- Department of Statistics, Harbin Normal University, Harbin, China
| | - Bo Zhang
- Department of Statistics, Harbin Normal University, Harbin, China
| |
Collapse
|
41
|
Newcombe PJ, Nelson CP, Samani NJ, Dudbridge F. A flexible and parallelizable approach to genome-wide polygenic risk scores. Genet Epidemiol 2019; 43:730-741. [PMID: 31328830 PMCID: PMC6764842 DOI: 10.1002/gepi.22245] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 05/03/2019] [Accepted: 05/30/2019] [Indexed: 01/06/2023]
Abstract
The heritability of most complex traits is driven by variants throughout the genome. Consequently, polygenic risk scores, which combine information on multiple variants genome-wide, have demonstrated improved accuracy in genetic risk prediction. We present a new two-step approach to constructing genome-wide polygenic risk scores from meta-GWAS summary statistics. Local linkage disequilibrium (LD) is adjusted for in Step 1, followed by, uniquely, long-range LD in Step 2. Our algorithm is highly parallelizable since block-wise analyses in Step 1 can be distributed across a high-performance computing cluster, and flexible, since sparsity and heritability are estimated within each block. Inference is obtained through a formal Bayesian variable selection framework, meaning final risk predictions are averaged over competing models. We compared our method to two alternative approaches: LDPred and lassosum using all seven traits in the Welcome Trust Case Control Consortium as well as meta-GWAS summaries for type 1 diabetes (T1D), coronary artery disease, and schizophrenia. Performance was generally similar across methods, although our framework provided more accurate predictions for T1D, for which there are multiple heterogeneous signals in regions of both short- and long-range LD. With sufficient compute resources, our method also allows the fastest runtimes.
Collapse
Affiliation(s)
- Paul J. Newcombe
- MRC Biostatistics Unit, School of Clinical Medicine, Cambridge Institute of Public HealthCambridge Biomedical CampusCambridgeUK
| | - Christopher P. Nelson
- Department of Cardiovascular Sciences, Cardiovascular Research Centre, Glenfield HospitalUniversity of LeicesterLeicesterUK
- NIHR Leicester Biomedical Research CentreGlenfield HospitalLeicesterUK
| | - Nilesh J. Samani
- Department of Cardiovascular Sciences, Cardiovascular Research Centre, Glenfield HospitalUniversity of LeicesterLeicesterUK
- NIHR Leicester Biomedical Research CentreGlenfield HospitalLeicesterUK
| | - Frank Dudbridge
- Department of Health Sciences, Centre for MedicineUniversity of LeicesterLeicesterUK
| |
Collapse
|
42
|
Liang X, Young WC, Hung LH, Raftery AE, Yeung KY. Integration of Multiple Data Sources for Gene Network Inference Using Genetic Perturbation Data. J Comput Biol 2019; 26:1113-1129. [PMID: 31009236 PMCID: PMC6786343 DOI: 10.1089/cmb.2019.0036] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
The inference of gene networks from large-scale human genomic data is challenging due to the difficulty in identifying correct regulators for each gene in a high-dimensional search space. We present a Bayesian approach integrating external data sources with knockdown data from human cell lines to infer gene regulatory networks. In particular, we assemble multiple data sources, including gene expression data, genome-wide binding data, gene ontology, and known pathways, and use a supervised learning framework to compute prior probabilities of regulatory relationships. We show that our integrated method improves the accuracy of inferred gene networks as well as extends some previous Bayesian frameworks both in theory and applications. We apply our method to two different human cell lines, namely skin melanoma cell line A375 and lung cancer cell line A549, to illustrate the capabilities of our method. Our results show that the improvement in performance could vary from cell line to cell line and that we might need to choose different external data sources serving as prior knowledge if we hope to obtain better accuracy for different cell lines.
Collapse
Affiliation(s)
- Xiao Liang
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia
| | - William Chad Young
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Ling-Hong Hung
- School of Engineering and Technology, University of Washington, Tacoma, Washington
| | - Adrian E. Raftery
- Department of Statistics, University of Washington, Seattle, Washington
| | - Ka Yee Yeung
- School of Engineering and Technology, University of Washington, Tacoma, Washington
| |
Collapse
|
43
|
Yang J, Peng J. Estimating Time-Varying Graphical Models. J Comput Graph Stat 2019; 29:191-202. [PMID: 33828398 PMCID: PMC8023339 DOI: 10.1080/10618600.2019.1647848] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 06/28/2019] [Accepted: 07/17/2019] [Indexed: 10/26/2022]
Abstract
In this paper, we study time-varying graphical models based on data measured over a temporal grid. Such models are motivated by the needs to describe and understand evolving interacting relationships among a set of random variables in many real applications, for instance the study of how stock prices interact with each other and how such interactions change over time. We propose a new model, LOcal Group Graphical Lasso Estimation (loggle), under the assumption that the graph topology changes gradually over time. Specifically, loggle uses a novel local group-lasso type penalty to efficiently incorporate information from neighboring time points and to impose structural smoothness of the graphs. We implement an ADMM based algorithm to fit the loggle model. This algorithm utilizes blockwise fast computation and pseudo-likelihood approximation to improve computational efficiency. An R package loggle has also been developed and is available on https://cran.r-project.org/. We evaluate the performance of loggle by simulation experiments. We also apply loggle to S&P 500 stock price data and demonstrate that loggle is able to reveal the interacting relationships among stock prices and among industrial sectors in a time period that covers the recent global financial crisis. The supplemental materials for this paper are also available online.
Collapse
Affiliation(s)
- Jilei Yang
- Department of Statistics, University of California, Davis
| | - Jie Peng
- Department of Statistics, University of California, Davis
| |
Collapse
|
44
|
Petralia F, Wang L, Peng J, Yan A, Zhu J, Wang P. A new method for constructing tumor specific gene co-expression networks based on samples with tumor purity heterogeneity. Bioinformatics 2019; 34:i528-i536. [PMID: 29949994 PMCID: PMC6022554 DOI: 10.1093/bioinformatics/bty280] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation Tumor tissue samples often contain an unknown fraction of stromal cells. This problem is widely known as tumor purity heterogeneity (TPH) was recently recognized as a severe issue in omics studies. Specifically, if TPH is ignored when inferring co-expression networks, edges are likely to be estimated among genes with mean shift between non-tumor- and tumor cells rather than among gene pairs interacting with each other in tumor cells. To address this issue, we propose Tumor Specific Net (TSNet), a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample. Results Using extensive synthetic data experiments, we demonstrate that TSNet outperforms a standard graphical model which does not account for TPH. We then apply TSNet to estimate tumor specific gene co-expression networks based on TCGA ovarian cancer RNAseq data. We identify novel co-expression modules and hub structure specific to tumor cells. Availability and implementation R codes can be found at https://github.com/petraf01/TSNet. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Francesca Petralia
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Li Wang
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Sema4, a Mount Sinai Venture, Stamford, CT, USA
| | - Jie Peng
- Department of Statistics, University of California, Davis, Davis, CA, USA
| | - Arthur Yan
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Jun Zhu
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Sema4, a Mount Sinai Venture, Stamford, CT, USA
| | - Pei Wang
- Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
45
|
Ma W, Chen LS, Özbek U, Han SW, Lin C, Paulovich AG, Zhong H, Wang P. Integrative Proteo-genomic Analysis to Construct CNA-protein Regulatory Map in Breast and Ovarian Tumors. Mol Cell Proteomics 2019; 18:S66-S81. [PMID: 31281117 PMCID: PMC6692778 DOI: 10.1074/mcp.ra118.001229] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Revised: 07/01/2019] [Indexed: 12/16/2022] Open
Abstract
Recent development in high throughput proteomics and genomics profiling enable one to study regulations of genome alterations on protein activities in a systematic manner. In this article, we propose a new statistical method, ProMAP, to systematically characterize the regulatory relationships between proteins and DNA copy number alterations (CNA) in breast and ovarian tumors based on proteogenomic data from the CPTAC-TCGA studies. Because of the dynamic nature of mass spectrometry instruments, proteomics data from labeled mass spectrometry experiments usually have non-ignorable batch effects. Moreover, mass spectrometry based proteomic data often possesses high percentages of missing values and non-ignorable missing-data patterns. Thus, we use a linear mixed effects model to account for the batch structure and explicitly incorporate the abundance-dependent-missing-data mechanism of proteomic data in ProMAP. In addition, we employ a multivariate regression framework to characterize the multiple-to-multiple regulatory relationships between CNA and proteins. Further, we use proper statistical regularization to facilitate the detection of master genetic regulators, which affect the activities of many proteins and often play important roles in genetic regulatory networks. Improved performance of ProMAP over existing methods were illustrated through extensive simulation studies and real data examples. Applying ProMAP to the CPTAC-TCGA breast and ovarian cancer data sets, we identified many genome regions, including a few novel ones, whose CNA were associated with protein and or phosphoprotein abundances. For example, in breast tumors, a small region in 8p11.21 was recognized as the second biggest hub in the CNA-phosphoprotein regulatory map, and further investigation of the regulatory targets suggests the potential role of 8p11.21 CNA in perturbing oxygen binding and transport activities in tumor cells. This and other findings from our analyses help to characterize the impacts of CNAs on protein activity landscapes and cast light on the genetic regulation mechanisms underlying these tumors.
Collapse
Affiliation(s)
- Weiping Ma
- ‡Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029
| | - Lin S. Chen
- §Department of Public Health Sciences, University of Chicago Chicago, IL 60637
| | - Umut Özbek
- ¶Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai New York, New York 10029
| | - Sung Won Han
- ‖School of Industrial Management Engineering, Korea University, 145, Anam-ro, Seongbuk-gu, Seoul, 02841, Rep. of KOREA
| | - Chenwei Lin
- **Clinical Research Division, Fred Hutchinson Cancer Research Center Seattle Washington 98109–1024
| | - Amanda G. Paulovich
- **Clinical Research Division, Fred Hutchinson Cancer Research Center Seattle Washington 98109–1024
| | - Hua Zhong
- ‡‡Division of Biostatistics, Department of Population Health, New York University New York, New York 10016
| | - Pei Wang
- ‡Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029
| |
Collapse
|
46
|
Uematsu Y, Fan Y, Chen K, Lv J, Lin W. SOFAR: Large-Scale Association Network Learning. IEEE TRANSACTIONS ON INFORMATION THEORY 2019; 65:4924-4939. [PMID: 33746241 PMCID: PMC7970712 DOI: 10.1109/tit.2019.2909889] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Many modern big data applications feature large scale in both numbers of responses and predictors. Better statistical efficiency and scientific insights can be enabled by understanding the large-scale response-predictor association network structures via layers of sparse latent factors ranked by importance. Yet sparsity and orthogonality have been two largely incompatible goals. To accommodate both features, in this paper we suggest the method of sparse orthogonal factor regression (SOFAR) via the sparse singular value decomposition with orthogonality constrained optimization to learn the underlying association networks, with broad applications to both unsupervised and supervised learning tasks such as biclustering with sparse singular value decomposition, sparse principal component analysis, sparse factor analysis, and spare vector autoregression analysis. Exploiting the framework of convexity-assisted nonconvex optimization, we derive nonasymptotic error bounds for the suggested procedure characterizing the theoretical advantages. The statistical guarantees are powered by an efficient SOFAR algorithm with convergence property. Both computational and theoretical advantages of our procedure are demonstrated with several simulations and real data examples.
Collapse
Affiliation(s)
- Yoshimasa Uematsu
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Yingying Fan
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Kun Chen
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Jinchi Lv
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| | - Wei Lin
- Yoshimasa Uematsu is Assistant Professor, Department of Economics and Management, Tohoku University, Sendai 980-8576, Japan. Yingying Fan is Dean's Associate Professor in Business Administration, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Kun Chen is Associate Professor, Department of Statistics, University of Connecticut, Storrs, CT 06269. Jinchi Lv is Kenneth King Stonier Chair in Business Administration and Professor, Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089. Wei Lin is Assistant Professor, School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing, China 100871
| |
Collapse
|
47
|
Luo S, Chen Z. Feature Selection by Canonical Correlation Search in High-Dimensional Multiresponse Models With Complex Group Structures. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1609972] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Shan Luo
- Department of Statistics, Shanghai Jiao Tong University, Shanghai, China
| | - Zehua Chen
- Department of Statistics & Applied Probability, National University of Singapore, Singapore
| |
Collapse
|
48
|
Li G, Liu X, Chen K. Integrative multi-view regression: Bridging group-sparse and low-rank models. Biometrics 2019; 75:593-602. [PMID: 30456759 PMCID: PMC6849205 DOI: 10.1111/biom.13006] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 10/24/2018] [Indexed: 11/30/2022]
Abstract
Multi-view data have been routinely collected in various fields of science and engineering. A general problem is to study the predictive association between multivariate responses and multi-view predictor sets, all of which can be of high dimensionality. It is likely that only a few views are relevant to prediction, and the predictors within each relevant view contribute to the prediction collectively rather than sparsely. We cast this new problem under the familiar multivariate regression framework and propose an integrative reduced-rank regression (iRRR), where each view has its own low-rank coefficient matrix. As such, latent features are extracted from each view in a supervised fashion. For model estimation, we develop a convex composite nuclear norm penalization approach, which admits an efficient algorithm via alternating direction method of multipliers. Extensions to non-Gaussian and incomplete data are discussed. Theoretically, we derive non-asymptotic oracle bounds of iRRR under a restricted eigenvalue condition. Our results recover oracle bounds of several special cases of iRRR including Lasso, group Lasso, and nuclear norm penalized regression. Therefore, iRRR seamlessly bridges group-sparse and low-rank methods and can achieve substantially faster convergence rate under realistic settings of multi-view learning. Simulation studies and an application in the Longitudinal Studies of Aging further showcase the efficacy of the proposed methods.
Collapse
Affiliation(s)
- Gen Li
- Department of Biostatistics, Columbia University, New York
| | - Xiaokang Liu
- Department of Statistics, University of Connecticut, Storrs, Connecticut
| | - Kun Chen
- Department of Statistics, University of Connecticut, Storrs, Connecticut
| |
Collapse
|
49
|
Deshpande SK, Ročková V, George EI. Simultaneous Variable and Covariance Selection With the Multivariate Spike-and-Slab LASSO. J Comput Graph Stat 2019. [DOI: 10.1080/10618600.2019.1593179] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Sameer K. Deshpande
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
| | - Veronika Ročková
- Department of Econometrics and Statistics at Booth School of Business, University of Chicago, Chicago, IL
| | - Edward I. George
- Department of Statistics, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
50
|
Ren J, Du Y, Li S, Ma S, Jiang Y, Wu C. Robust network-based regularization and variable selection for high-dimensional genomic data in cancer prognosis. Genet Epidemiol 2019; 43:276-291. [PMID: 30746793 PMCID: PMC6446588 DOI: 10.1002/gepi.22194] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2018] [Revised: 11/19/2018] [Accepted: 11/29/2018] [Indexed: 12/21/2022]
Abstract
In cancer genomic studies, an important objective is to identify prognostic markers associated with patients' survival. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, because of its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In this study, we develop a novel robust network-based variable selection method under the accelerated failure time model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Two case studies of lung cancer datasets with high-dimensional gene expression measurements demonstrate that the proposed approach has identified markers with important implications.
Collapse
Affiliation(s)
- Jie Ren
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Yinhao Du
- Department of Statistics, Kansas State University, Manhattan, KS
| | - Shaoyu Li
- Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT
| | - Yu Jiang
- Division of Epidemiology, Biostatistics and Environmental Health, School of Public Health, University of Memphis, Memphis, TN
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS
| |
Collapse
|