1
|
Djordjilović V, Ponzi E, Nøst TH, Thoresen M. penalizedclr: an R package for penalized conditional logistic regression for integration of multiple omics layers. BMC Bioinformatics 2024; 25:226. [PMID: 38937668 PMCID: PMC11212437 DOI: 10.1186/s12859-024-05850-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 06/20/2024] [Indexed: 06/29/2024] Open
Abstract
BACKGROUND The matched case-control design, up until recently mostly pertinent to epidemiological studies, is becoming customary in biomedical applications as well. For instance, in omics studies, it is quite common to compare cancer and healthy tissue from the same patient. Furthermore, researchers today routinely collect data from various and variable sources that they wish to relate to the case-control status. This highlights the need to develop and implement statistical methods that can take these tendencies into account. RESULTS We present an R package penalizedclr, that provides an implementation of the penalized conditional logistic regression model for analyzing matched case-control studies. It allows for different penalties for different blocks of covariates, and it is therefore particularly useful in the presence of multi-source omics data. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection for variable selection in the considered regression model. CONCLUSIONS The proposed method fills a gap in the available software for fitting high-dimensional conditional logistic regression models accounting for the matched design and block structure of predictors/features. The output consists of a set of selected variables that are significantly associated with case-control status. These variables can then be investigated in terms of functional interpretation or validation in further, more targeted studies.
Collapse
Affiliation(s)
- Vera Djordjilović
- Department of Economics, Ca' Foscari University of Venice, Venice, Italy.
- Department of Biostatistics, University of Oslo, Oslo, Norway.
| | - Erica Ponzi
- Department of Biostatistics, University of Oslo, Oslo, Norway
| | - Therese Haugdahl Nøst
- Department of Public Health and Nursing, Norwegian University of Science and Technology, Trondheim, Norway
- Department of Community Medicine, Faculty of Health Sciences, The Arctic University of Norway, Tromsø, Norway
| | - Magne Thoresen
- Department of Biostatistics, University of Oslo, Oslo, Norway
| |
Collapse
|
2
|
Buch G, Schulz A, Schmidtmann I, Strauch K, Wild PS. Sparse Group Penalties for bi-level variable selection. Biom J 2024; 66:e2200334. [PMID: 38747086 DOI: 10.1002/bimj.202200334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 02/05/2024] [Accepted: 02/07/2024] [Indexed: 06/29/2024]
Abstract
Many data sets exhibit a natural group structure due to contextual similarities or high correlations of variables, such as lipid markers that are interrelated based on biochemical principles. Knowledge of such groupings can be used through bi-level selection methods to identify relevant feature groups and highlight their predictive members. One of the best known approaches of this kind combines the classical Least Absolute Shrinkage and Selection Operator (LASSO) with the Group LASSO, resulting in the Sparse Group LASSO. We propose the Sparse Group Penalty (SGP) framework, which allows for a flexible combination of different SGL-style shrinkage conditions. Analogous to SGL, we investigated the combination of the Smoothly Clipped Absolute Deviation (SCAD), the Minimax Concave Penalty (MCP) and the Exponential Penalty (EP) with their group versions, resulting in the Sparse Group SCAD, the Sparse Group MCP, and the novel Sparse Group EP (SGE). Those shrinkage operators provide refined control of the effect of group formation on the selection process through a tuning parameter. In simulation studies, SGPs were compared with other bi-level selection methods (Group Bridge, composite MCP, and Group Exponential LASSO) for variable and group selection evaluated with the Matthews correlation coefficient. We demonstrated the advantages of the new SGE in identifying parsimonious models, but also identified scenarios that highlight the limitations of the approach. The performance of the techniques was further investigated in a real-world use case for the selection of regulated lipids in a randomized clinical trial.
Collapse
Affiliation(s)
- Gregor Buch
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- German Center for Cardiovascular Research (DZHK), Mainz, Germany
| | - Andreas Schulz
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Irene Schmidtmann
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Konstantin Strauch
- Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
| | - Philipp S Wild
- Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- German Center for Cardiovascular Research (DZHK), Mainz, Germany
- Clinical Epidemiology and Systems Medicine, Center for Thrombosis and Hemostasis, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
- Institute of Molecular Biology (IMB), Mainz, Germany
| |
Collapse
|
3
|
Chai H, Lin S, Lin J, He M, Yang Y, OuYang Y, Zhao H. An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome. BMC Bioinformatics 2024; 25:88. [PMID: 38418940 PMCID: PMC10902951 DOI: 10.1186/s12859-024-05716-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 02/21/2024] [Indexed: 03/02/2024] Open
Abstract
BACKGROUND Predicting outcome of breast cancer is important for selecting appropriate treatments and prolonging the survival periods of patients. Recently, different deep learning-based methods have been carefully designed for cancer outcome prediction. However, the application of these methods is still challenged by interpretability. In this study, we proposed a novel multitask deep neural network called UISNet to predict the outcome of breast cancer. The UISNet is able to interpret the importance of features for the prediction model via an uncertainty-based integrated gradients algorithm. UISNet improved the prediction by introducing prior biological pathway knowledge and utilizing patient heterogeneity information. RESULTS The model was tested in seven public datasets of breast cancer, and showed better performance (average C-index = 0.691) than the state-of-the-art methods (average C-index = 0.650, ranged from 0.619 to 0.677). Importantly, the UISNet identified 20 genes as associated with breast cancer, among which 11 have been proven to be associated with breast cancer by previous studies, and others are novel findings of this study. CONCLUSIONS Our proposed method is accurate and robust in predicting breast cancer outcomes, and it is an effective way to identify breast cancer-associated genes. The method codes are available at: https://github.com/chh171/UISNet .
Collapse
Affiliation(s)
- Hua Chai
- School of Mathematics and Big Data, Foshan University, Foshan, 528000, China
| | - Siyin Lin
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Junqi Lin
- School of Mathematics and Big Data, Foshan University, Foshan, 528000, China
| | - Minfan He
- School of Mathematics and Big Data, Foshan University, Foshan, 528000, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yongzhong OuYang
- School of Mathematics and Big Data, Foshan University, Foshan, 528000, China.
| | - Huiying Zhao
- Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, 510000, China.
| |
Collapse
|
4
|
Downing T, Angelopoulos N. A primer on correlation-based dimension reduction methods for multi-omics analysis. J R Soc Interface 2023; 20:20230344. [PMID: 37817584 PMCID: PMC10565429 DOI: 10.1098/rsif.2023.0344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 09/19/2023] [Indexed: 10/12/2023] Open
Abstract
The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will help researchers navigate emerging methods for multi-omics and integrating diverse omic datasets appropriately. This raises the opportunity of implementing population multi-omics with large sample sizes as omics technologies and our understanding improve.
Collapse
Affiliation(s)
- Tim Downing
- Pirbright Institute, Pirbright, Surrey, UK
- Department of Biotechnology, Dublin City University, Dublin, Ireland
| | | |
Collapse
|
5
|
Wang Q, He M, Guo L, Chai H. AFEI: adaptive optimized vertical federated learning for heterogeneous multi-omics data integration. Brief Bioinform 2023; 24:bbad269. [PMID: 37497720 DOI: 10.1093/bib/bbad269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 06/26/2023] [Accepted: 07/04/2023] [Indexed: 07/28/2023] Open
Abstract
Vertical federated learning has gained popularity as a means of enabling collaboration and information sharing between different entities while maintaining data privacy and security. This approach has potential applications in disease healthcare, cancer prognosis prediction, and other industries where data privacy is a major concern. Although using multi-omics data for cancer prognosis prediction provides more information for treatment selection, collecting different types of omics data can be challenging due to their production in various medical institutions. Data owners must comply with strict data protection regulations such as European Union (EU) General Data Protection Regulation. To share patient data across multiple institutions, privacy and security issues must be addressed. Therefore, we propose an adaptive optimized vertical federated-learning-based framework adaptive optimized vertical federated learning for heterogeneous multi-omics data integration (AFEI) to integrate multi-omics data collected from multiple institutions for cancer prognosis prediction. AFEI enables participating parties to build an accurate joint evaluation model for learning more information related to cancer patients from different perspectives, based on the distributed and encrypted multi-omics features shared by multiple institutions. The experimental results demonstrate that AFEI achieves higher prediction accuracy (6.5% on average) than using single omics data by utilizing the encrypted multi-omics data from different institutions, and it performs almost as well as prognosis prediction by directly integrating multi-omics data. Overall, AFEI can be seen as an efficient solution for breaking down barriers to multi-institutional collaboration and promoting the development of cancer prognosis prediction.
Collapse
Affiliation(s)
- Qingyong Wang
- School of Information and Computer, Anhui Agricultural University, Hefei 230000, China
| | - Minfan He
- School of Mathematics and Big Data, Foshan University, Foshan 528000, China
| | - Longyi Guo
- Guangdong Provincial Hospital of Traditional Chinese Medical, Guangzhou 510000, China
| | - Hua Chai
- School of Mathematics and Big Data, Foshan University, Foshan 528000, China
| |
Collapse
|
6
|
van Nee MM, Wessels LFA, van de Wiel MA. ecpc: an R-package for generic co-data models for high-dimensional prediction. BMC Bioinformatics 2023; 24:172. [PMID: 37101151 PMCID: PMC10134536 DOI: 10.1186/s12859-023-05289-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2022] [Accepted: 04/12/2023] [Indexed: 04/28/2023] Open
Abstract
BACKGROUND High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable-specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, were handled by adaptive discretisation, potentially inefficiently modelling and losing information. As continuous co-data such as external p values or correlations often arise in practice, more generic co-data models are needed. RESULTS Here, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. The extension is faster than the original method and shows improved prediction and variable selection performance for non-linear co-data relations. Moreover, we demonstrate use of the package in several genomics examples throughout the paper. CONCLUSIONS The R-package ecpc accommodates linear, generalised additive and shape constrained additive co-data models for the purpose of improved high-dimensional prediction and variable selection. The extended version of the package as presented here (version number 3.1.1 and higher) is available on ( https://cran.r-project.org/web/packages/ecpc/ ).
Collapse
Affiliation(s)
- Mirrelijn M van Nee
- Epidemiology & Data Science, Amsterdam Public Health research institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands.
| | - Lodewyk F A Wessels
- Molecular Carcinogenesis, Netherlands Cancer Institute, Amsterdam, The Netherlands
- Computational Cancer Biology, Oncode Institute, Amsterdam, The Netherlands
- Intelligent Systems, Delft University Medical Centers, Delft, The Netherlands
| | - Mark A van de Wiel
- Epidemiology & Data Science, Amsterdam Public Health research institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
7
|
Zhang R, Datta S. Adaptive Sparse Multi-Block PLS Discriminant Analysis: An Integrative Method for Identifying Key Biomarkers from Multi-Omics Data. Genes (Basel) 2023; 14:genes14050961. [PMID: 37239321 DOI: 10.3390/genes14050961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 04/06/2023] [Accepted: 04/21/2023] [Indexed: 05/28/2023] Open
Abstract
With the growing use of high-throughput technologies, multi-omics data containing various types of high-dimensional omics data is increasingly being generated to explore the association between the molecular mechanism of the host and diseases. In this study, we present an adaptive sparse multi-block partial least square discriminant analysis (asmbPLS-DA), an extension of our previous work, asmbPLS. This integrative approach identifies the most relevant features across different types of omics data while discriminating multiple disease outcome groups. We used simulation data with various scenarios and a real dataset from the TCGA project to demonstrate that asmbPLS-DA can identify key biomarkers from each type of omics data with better biological relevance than existing competitive methods. Moreover, asmbPLS-DA showed comparable performance in the classification of subjects in terms of disease status or phenotypes using integrated multi-omics molecular profiles, especially when combined with other classification algorithms, such as linear discriminant analysis and random forest. We have made the R package called asmbPLS that implements this method publicly available on GitHub. Overall, asmbPLS-DA achieved competitive performance in terms of feature selection and classification. We believe that asmbPLS-DA can be a valuable tool for multi-omics research.
Collapse
Affiliation(s)
- Runzhi Zhang
- Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA
| | - Susmita Datta
- Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA
| |
Collapse
|
8
|
Zhang R, Datta S. asmbPLS: Adaptive Sparse Multi-block Partial Least Square for Survival Prediction using Multi-Omics Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.03.535442. [PMID: 37066143 PMCID: PMC10103991 DOI: 10.1101/2023.04.03.535442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
Background As high-throughput studies advance, more and more high-dimensional multi-omics data are available and collected from the same patient cohort. Using multi-omics data as predictors to predict survival outcomes is challenging due to the complex structure of such data. Results In this article, we introduce an adaptive sparse multi-block partial least square (asmbPLS) regression method by assigning different penalty factors to different blocks in different PLS components for feature selection and prediction. We compared the proposed method with several competitive algorithms in many aspects including prediction performance, feature selection and computation efficiency. The performance and the efficiency of our method were demonstrated using both the simulated and the real data. Conclusions In summary, asmbPLS achieved a competitive performance in prediction, feature selection, and computation efficiency. We anticipate asmbPLS to be a valuable tool for multi-omics research. An R package called asmbPLS implementing this method is made publicly available on GitHub.
Collapse
|
9
|
Zhong T, Zhang Q, Huang J, Wu M, Ma S. HETEROGENEITY ANALYSIS VIA INTEGRATING MULTI-SOURCES HIGH-DIMENSIONAL DATA WITH APPLICATIONS TO CANCER STUDIES. Stat Sin 2023; 33:729-758. [PMID: 38037567 PMCID: PMC10686523 DOI: 10.5705/ss.202021.0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
Abstract
This study has been motivated by cancer research, in which heterogeneity analysis plays an important role and can be roughly classified as unsupervised or supervised. In supervised heterogeneity analysis, the finite mixture of regression (FMR) technique is used extensively, under which the covariates affect the response differently in subgroups. High-dimensional molecular and, very recently, histopathological imaging features have been analyzed separately and shown to be effective for heterogeneity analysis. For simpler analysis, they have been shown to contain overlapping, but also independent information. In this article, our goal is to conduct the first and more effective FMR-based cancer heterogeneity analysis by integrating high-dimensional molecular and histopathological imaging features. A penalization approach is developed to regularize estimation, select relevant variables, and, equally importantly, promote the identification of independent information. Consistency properties are rigorously established. An effective computational algorithm is developed. A simulation and an analysis of The Cancer Genome Atlas (TCGA) lung cancer data demonstrate the practical effectiveness of the proposed approach. Overall, this study provides a practical and useful new way of conducting supervised cancer heterogeneity analysis.
Collapse
Affiliation(s)
- Tingyan Zhong
- SJTU-Yale Joint Center for Biostatistics, Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, Fujian, China
| | - Jian Huang
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Kowloon, Hong Kong
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT 06520-0834, USA
| |
Collapse
|
10
|
Tay JK, Aghaeepour N, Hastie T, Tibshirani R. Feature-weighted elastic net: using "features of features" for better prediction. Stat Sin 2023; 33:259-279. [PMID: 37102071 PMCID: PMC10129060 DOI: 10.5705/ss.202020.0226] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.
Collapse
Affiliation(s)
| | - Nima Aghaeepour
- Department of Anesthesiology, Pain, and Perioperative Medicine, Stanford University
- Department of Pediatrics, Stanford University
- Department of Biomedical Data Sciences, Stanford University
| | - Trevor Hastie
- Department of Statistics, Stanford University
- Department of Biomedical Data Sciences, Stanford University
| | - Robert Tibshirani
- Department of Statistics, Stanford University
- Department of Biomedical Data Sciences, Stanford University
| |
Collapse
|
11
|
Ng HM, Jiang B, Wong KY. Penalized estimation of a class of single-index varying-coefficient models for integrative genomic analysis. Biom J 2023; 65:e2100139. [PMID: 35837982 DOI: 10.1002/bimj.202100139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 04/15/2022] [Accepted: 05/27/2022] [Indexed: 01/17/2023]
Abstract
Recent technological advances have made it possible to collect high-dimensional genomic data along with clinical data on a large number of subjects. In the studies of chronic diseases such as cancer, it is of great interest to integrate clinical and genomic data to build a comprehensive understanding of the disease mechanisms. Despite extensive studies on integrative analysis, it remains an ongoing challenge to model the interaction effects between clinical and genomic variables, due to high dimensionality of the data and heterogeneity across data types. In this paper, we propose an integrative approach that models interaction effects using a single-index varying-coefficient model, where the effects of genomic features can be modified by clinical variables. We propose a penalized approach for separate selection of main and interaction effects. Notably, the proposed methods can be applied to right-censored survival outcomes based on a Cox proportional hazards model. We demonstrate the advantages of the proposed methods through extensive simulation studies and provide applications to a motivating cancer genomic study.
Collapse
Affiliation(s)
- Hoi Min Ng
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| | - Binyan Jiang
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| | - Kin Yau Wong
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| |
Collapse
|
12
|
van Nee MM, van de Brug T, van de Wiel MA. Fast Marginal Likelihood Estimation of Penalties for Group-Adaptive Elastic Net. J Comput Graph Stat 2022; 32:950-960. [PMID: 38013849 PMCID: PMC10511031 DOI: 10.1080/10618600.2022.2128809] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 09/12/2022] [Indexed: 10/10/2022]
Abstract
Elastic net penalization is widely used in high-dimensional prediction and variable selection settings. Auxiliary information on the variables, for example, groups of variables, is often available. Group-adaptive elastic net penalization exploits this information to potentially improve performance by estimating group penalties, thereby penalizing important groups of variables less than other groups. Estimating these group penalties is, however, hard due to the high dimension of the data. Existing methods are computationally expensive or not generic in the type of response. Here we present a fast method for estimation of group-adaptive elastic net penalties for generalized linear models. We first derive a low-dimensional representation of the Taylor approximation of the marginal likelihood for group-adaptive ridge penalties, to efficiently estimate these penalties. Then we show by using asymptotic normality of the linear predictors that this marginal likelihood approximates that of elastic net models. The ridge group penalties are then transformed to elastic net group penalties by matching the ridge prior variance to the elastic net prior variance as function of the group penalties. The method allows for overlapping groups and unpenalized variables, and is easily extended to other penalties. For a model-based simulation study and two cancer genomics applications we demonstrate a substantially decreased computation time and improved or matching performance compared to other methods. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Mirrelijn M. van Nee
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Tim van de Brug
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Mark A. van de Wiel
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
13
|
Prognostic Gene Expression-Based Signature in Clear-Cell Renal Cell Carcinoma. Cancers (Basel) 2022; 14:cancers14153754. [PMID: 35954418 PMCID: PMC9367562 DOI: 10.3390/cancers14153754] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 07/21/2022] [Accepted: 07/22/2022] [Indexed: 02/01/2023] Open
Abstract
The inaccuracy of the current prognostic algorithms and the potential changes in the therapeutic management of localized ccRCC demands the development of an improved prognostic model for these patients. To this end, we analyzed whole-transcriptome profiling of 26 tissue samples from progressive and non-progressive ccRCCs using Illumina Hi-seq 4000. Differentially expressed genes (DEG) were intersected with the RNA-sequencing data from the TCGA. The overlapping genes were used for further analysis. A total of 132 genes were found to be prognosis-related genes. LASSO regression enabled the development of the best prognostic six-gene panel. Cox regression analyses were performed to identify independent clinical prognostic parameters to construct a combined nomogram which includes the expression of CERCAM, MIA2, HS6ST2, ONECUT2, SOX12, TMEM132A, pT stage, tumor size and ISUP grade. A risk score generated using this model effectively stratified patients at higher risk of disease progression (HR 10.79; p < 0.001) and cancer-specific death (HR 19.27; p < 0.001). It correlated with the clinicopathological variables, enabling us to discriminate a subset of patients at higher risk of progression within the Stage, Size, Grade and Necrosis score (SSIGN) risk groups, pT and ISUP grade. In summary, a gene expression-based prognostic signature was successfully developed providing a more precise assessment of the individual risk of progression.
Collapse
|
14
|
Zhao Z, Wang S, Zucknick M, Aittokallio T. Tissue-specific identification of multi-omics features for pan-cancer drug response prediction. iScience 2022; 25:104767. [PMID: 35992090 PMCID: PMC9385562 DOI: 10.1016/j.isci.2022.104767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 06/28/2022] [Accepted: 07/11/2022] [Indexed: 11/29/2022] Open
Abstract
Current statistical models for drug response prediction and biomarker identification fall short in leveraging the shared and unique information from various cancer tissues and multi-omics profiles. We developed mix-lasso model that introduces an additional sample group penalty term to capture tissue-specific effects of features on pan-cancer response prediction. The mix-lasso model takes into account both the similarity between drug responses (i.e., multi-task learning), and the heterogeneity between multi-omics data (multi-modal learning). When applied to large-scale pharmacogenomics dataset from Cancer Therapeutics Response Portal, mix-lasso enabled accurate drug response predictions and identification of tissue-specific predictive features in the presence of various degrees of missing data, drug-drug correlations, and high-dimensional and correlated genomic and molecular features that often hinder the use of statistical approaches in drug response modeling. Compared to tree lasso model, mix-lasso identified a smaller number of tissue-specific features, hence making the model more interpretable and stable for drug discovery applications. Pan-cancer cell lines provide a test bench for exploring gene-drug relationships Multi-omics data were integrated with pharmacological profiles for joint modeling Mix-lasso identifies tissue-specific biomarkers predictive of multi-drug responses Mix-lasso provides small number of stable features for drug discovery applications
Collapse
Affiliation(s)
- Zhi Zhao
- Institute for Cancer Research, Department of Cancer Genetics, Oslo University Hospital, Norway
- Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Norway
| | - Shixiong Wang
- Institute for Cancer Research, Department of Cancer Genetics, Oslo University Hospital, Norway
| | - Manuela Zucknick
- Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Norway
- Corresponding author
| | - Tero Aittokallio
- Institute for Cancer Research, Department of Cancer Genetics, Oslo University Hospital, Norway
- Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Norway
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Finland
- Corresponding author
| |
Collapse
|
15
|
He H, Guo X, Yu J, Ai C, Shi S. Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking. Bioinformatics 2022; 38:1542-1549. [PMID: 34908103 DOI: 10.1093/bioinformatics/btab848] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 12/08/2021] [Accepted: 12/13/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Efficiently identifying genes based on gene expression level have been studied to help to classify different cancer types and improve the prediction performance. Logistic regression model based on regularization technique is often one of the effective approaches for simultaneously realizing prediction and feature (gene) selection in genomic data of high dimensionality. However, standard methods ignore biological group structure and generally result in poorer predictive models. RESULTS In this article, we develop a classifier named Stacked SGL that satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group. We propose to use stacked generalization to combine different ratios rather than choosing one ratio, which could help to overcome the inadaptability of sparse group lasso for some data. Considering that stacking weakens feature selection, we perform a post hoc feature selection which might slightly reduce predictive performance, but it shows superior in feature selection. Experimental results on simulation demonstrate that our approach enjoys competitive and stable classification performance and lower false discovery rate in feature selection for varying sets of data compared with other regularization methods. In addition, our method presents better accuracy in three public cancer datasets and identifies more powerful discriminatory and potential mutation genes for thyroid carcinoma. AVAILABILITY AND IMPLEMENTATION The real data underlying this article are available from https://github.com/huanheaha/Stacked_SGL; https://zenodo.org/record/5761577#.YbAUyciEwk2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Huan He
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Xinyun Guo
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Jialin Yu
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Chen Ai
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| | - Shaoping Shi
- Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China
| |
Collapse
|
16
|
Das S, Mukhopadhyay I. TiMEG: an integrative statistical method for partially missing multi-omics data. Sci Rep 2021; 11:24077. [PMID: 34911979 PMCID: PMC8674330 DOI: 10.1038/s41598-021-03034-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2021] [Accepted: 11/24/2021] [Indexed: 11/25/2022] Open
Abstract
Multi-omics data integration is widely used to understand the genetic architecture of disease. In multi-omics association analysis, data collected on multiple omics for the same set of individuals are immensely important for biomarker identification. But when the sample size of such data is limited, the presence of partially missing individual-level observations poses a major challenge in data integration. More often, genotype data are available for all individuals under study but gene expression and/or methylation information are missing for different subsets of those individuals. Here, we develop a statistical model TiMEG, for the identification of disease-associated biomarkers in a case-control paradigm by integrating the above-mentioned data types, especially, in presence of missing omics data. Based on a likelihood approach, TiMEG exploits the inter-relationship among multiple omics data to capture weaker signals, that remain unidentified in single-omic analysis or common imputation-based methods. Its application on a real tuberous sclerosis dataset identified functionally relevant genes in the disease pathway.
Collapse
Affiliation(s)
- Sarmistha Das
- Human Genetics Unit, Indian Statistical Institute, Kolkata, 700108, India
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, 38105, USA
| | | |
Collapse
|
17
|
Madjar K, Rahnenführer J. Weighted Cox regression for the prediction of heterogeneous patient subgroups. BMC Med Inform Decis Mak 2021; 21:342. [PMID: 34876106 PMCID: PMC8650299 DOI: 10.1186/s12911-021-01698-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 11/23/2021] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND An important task in clinical medicine is the construction of risk prediction models for specific subgroups of patients based on high-dimensional molecular measurements such as gene expression data. Major objectives in modeling high-dimensional data are good prediction performance and feature selection to find a subset of predictors that are truly associated with a clinical outcome such as a time-to-event endpoint. In clinical practice, this task is challenging since patient cohorts are typically small and can be heterogeneous with regard to their relationship between predictors and outcome. When data of several subgroups of patients with the same or similar disease are available, it is tempting to combine them to increase sample size, such as in multicenter studies. However, heterogeneity between subgroups can lead to biased results and subgroup-specific effects may remain undetected. METHODS For this situation, we propose a penalized Cox regression model with a weighted version of the Cox partial likelihood that includes patients of all subgroups but assigns them individual weights based on their subgroup affiliation. The weights are estimated from the data such that patients who are likely to belong to the subgroup of interest obtain higher weights in the subgroup-specific model. RESULTS Our proposed approach is evaluated through simulations and application to real lung cancer cohorts, and compared to existing approaches. Simulation results demonstrate that our proposed model is superior to standard approaches in terms of prediction performance and variable selection accuracy when the sample size is small. CONCLUSIONS The results suggest that sharing information between subgroups by incorporating appropriate weights into the likelihood can increase power to identify the prognostic covariates and improve risk prediction.
Collapse
Affiliation(s)
- Katrin Madjar
- Department of Statistics, TU Dortmund University, 44221, Dortmund, Germany.
| | - Jörg Rahnenführer
- Department of Statistics, TU Dortmund University, 44221, Dortmund, Germany
| |
Collapse
|
18
|
Learning social networks from text data using covariate information. STAT METHOD APPL-GER 2021. [DOI: 10.1007/s10260-021-00586-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractAccurately describing the lives of historical figures can be challenging, but unraveling their social structures perhaps is even more so. Historical social network analysis methods can help in this regard and may even illuminate individuals who have been overlooked by historians, but turn out to be influential social connection points. Text data, such as biographies, are a useful source of information for learning historical social networks but the identifcation of links based on text data can be challenging. The Local Poisson Graphical Lasso model models social networks by conditional independence structures, and leverages the number of name co-mentions in the text to infer relationships. However, this method does not take into account the abundance of covariate information that is often available in text data. Conditional independence structure like Poisson Graphical Model, which makes use name mention counts in the text can be useful tools to avoid false positive links due to the co-mentions but given historical tendency of frequently used or common names, without additional distinguishing information, we may introduce incorrect connections. In this work, we therefore extend the Local Poisson Graphical Lasso model with a (multiple) penalty structure that incorporates covariates, opening up the opportunity for similar individuals to have a higher probability of being connected. We propose both greedy and Bayesian approaches to estimate the penalty parameters. We present results on data simulated with characteristics of historical networks and show that this type of penalty structure can improve network recovery as measured by precision and recall. We also illustrate the approach on biographical data of individuals who lived in early modern Britain between 1500 and 1575. We will show how these covariates affect the statistical model’s performance using simulations, discuss how it helps to better identify links for the people with common names and those who are traditionally underrepresented in the biography text data.
Collapse
|
19
|
Synergistic Effects of Different Levels of Genomic Data for the Staging of Lung Adenocarcinoma: An Illustrative Study. Genes (Basel) 2021; 12:genes12121872. [PMID: 34946821 PMCID: PMC8700916 DOI: 10.3390/genes12121872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Revised: 11/18/2021] [Accepted: 11/24/2021] [Indexed: 11/17/2022] Open
Abstract
Lung adenocarcinoma (LUAD) is a common and very lethal cancer. Accurate staging is a prerequisite for its effective diagnosis and treatment. Therefore, improving the accuracy of the stage prediction of LUAD patients is of great clinical relevance. Previous works have mainly focused on single genomic data information or a small number of different omics data types concurrently for generating predictive models. A few of them have considered multi-omics data from genome to proteome. We used a publicly available dataset to illustrate the potential of multi-omics data for stage prediction in LUAD. In particular, we investigated the roles of the specific omics data types in the prediction process. We used a self-developed method, Omics-MKL, for stage prediction that combines an existing feature ranking technique Minimum Redundancy and Maximum Relevance (mRMR), which avoids redundancy among the selected features, and multiple kernel learning (MKL), applying different kernels for different omics data types. Each of the considered omics data types individually provided useful prediction results. Moreover, using multi-omics data delivered notably better results than using single-omics data. Gene expression and methylation information seem to play vital roles in the staging of LUAD. The Omics-MKL method retained 70 features after the selection process. Of these, 21 (30%) were methylation features and 34 (48.57%) were gene expression features. Moreover, 18 (25.71%) of the selected features are known to be related to LUAD, and 29 (41.43%) to lung cancer in general. Using multi-omics data from genome to proteome for predicting the stage of LUAD seems promising because each omics data type may improve the accuracy of the predictions. Here, methylation and gene expression data may play particularly important roles.
Collapse
|
20
|
van Nee MM, Wessels LFA, van de Wiel MA. Flexible co-data learning for high-dimensional prediction. Stat Med 2021; 40:5910-5925. [PMID: 34438466 PMCID: PMC9292202 DOI: 10.1002/sim.9162] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/18/2021] [Accepted: 07/29/2021] [Indexed: 02/06/2023]
Abstract
Clinical research often focuses on complex traits in which many variables play a role in mechanisms driving, or curing, diseases. Clinical prediction is hard when data is high-dimensional, but additional information, like domain knowledge and previously published studies, may be helpful to improve predictions. Such complementary data, or co-data, provide information on the covariates, such as genomic location or P-values from external studies. We use multiple and various co-data to define possibly overlapping or hierarchically structured groups of covariates. These are then used to estimate adaptive multi-group ridge penalties for generalized linear and Cox models. Available group adaptive methods primarily target for settings with few groups, and therefore likely overfit for non-informative, correlated or many groups, and do not account for known structure on group level. To handle these issues, our method combines empirical Bayes estimation of the hyperparameters with an extra level of flexible shrinkage. This renders a uniquely flexible framework as any type of shrinkage can be used on the group level. We describe various types of co-data and propose suitable forms of hypershrinkage. The method is very versatile, as it allows for integration and weighting of multiple co-data sets, inclusion of unpenalized covariates and posterior variable selection. For three cancer genomics applications we demonstrate improvements compared to other models in terms of performance, variable selection stability and validation.
Collapse
Affiliation(s)
- Mirrelijn M van Nee
- Epidemiology & Data Science
- Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Lodewyk F A Wessels
- Molecular Carcinogenesis, Netherlands Cancer Institute, Amsterdam, The Netherlands.,Computational Cancer Biology, Oncode Institute, Amsterdam, The Netherlands.,Intelligent Systems, Delft University of Technology, Delft, The Netherlands
| | - Mark A van de Wiel
- Epidemiology & Data Science
- Amsterdam Public Health Research Institute, Amsterdam University Medical Centers, Amsterdam, The Netherlands.,MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| |
Collapse
|
21
|
Han H, Dawson KJ. Applying elastic-net regression to identify the best models predicting changes in civic purpose during the emerging adulthood. J Adolesc 2021; 93:20-27. [PMID: 34634726 DOI: 10.1016/j.adolescence.2021.09.011] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 08/04/2021] [Accepted: 09/29/2021] [Indexed: 10/20/2022]
Abstract
INTRODUCTION Changes in civic purpose during the emerging adulthood has been a significant research topic since it is closely associated with active civic engagement later in human lives. While standard regression methods have been used in previous studies to predict civic purpose development, they have limitations that may not always lead to best prediction models. We aimed to address these limitations by utilizing elastic-net multinomial logistic regression, which favors models with the least number of necessary predictors, in exploration of predictors for civic purpose development in a data-driven manner. METHODS We analyzed data from the longitudinal Civic Purpose Project while focusing on the model that best predicted civic purpose from Wave 1 (12th grade before high school graduation) to Wave 2 (two years after Wave 1). The reanalyzed data included responses from 476 participants (60.29% females, 39.08% males) who were recruited from Californian high schools in the United States and completed the survey at both Waves. The elastic-net regression was performed 5000 times for predicting three dependent variables, Wave 2 political purpose, community service purpose, and expressive activity purpose, with Wave 1 predictors. We identified which predictors were selected as the constituents of the best regression models during the elastic-net regression process. RESULTS Results showed that civic purpose, moral and political identity, and external supports (e.g., parental and peer involvement, school civic opportunities, etc.) in Wave 1 significantly predicted civic purpose in Wave 2. Several predictors were excluded from the regression models during the elastic-net regression process. CONCLUSION We found that the elastic-net regression was able to present the more regularized model for prediction. Implications for promoting civic purpose are discussed as well as utilizing the elastic-net regression method.
Collapse
Affiliation(s)
- Hyemin Han
- Educational Psychology Program, University of Alabama, USA.
| | | |
Collapse
|
22
|
Duan R, Gao L, Gao Y, Hu Y, Xu H, Huang M, Song K, Wang H, Dong Y, Jiang C, Zhang C, Jia S. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput Biol 2021; 17:e1009224. [PMID: 34383739 PMCID: PMC8384175 DOI: 10.1371/journal.pcbi.1009224] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 08/24/2021] [Accepted: 06/28/2021] [Indexed: 11/18/2022] Open
Abstract
Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis. Cancer is one of the most heterogeneous diseases, characterized by diverse morphological, phenotypic, and genomic profiles between tumors and their subtypes. Identifying cancer subtypes can help patients receive precise treatments. With the development of high-throughput technologies, genomics, epigenomics, and transcriptomics data have been generated for large cancer patient cohorts. It is believed that the more omics data we use, the more accurate identification of cancer subtypes. To examine this assumption, we first constructed three classes of benchmarking datasets to conduct a comprehensive evaluation and comparison of ten representative multi-omics data integration methods for cancer subtyping by considering their accuracy, robustness, and computational efficiency. Then, we investigated the influence of different omics data and their various combinations on the effectiveness of cancer subtyping. Our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. We hope that our work may help researchers choose a proper method and an effective data combination when identifying cancer subtypes using data integration methods.
Collapse
Affiliation(s)
- Ran Duan
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi’an, China
- * E-mail:
| | - Yong Gao
- Department of Computer Science, The University of British Columbia Okanagan, Kelowna, British Columbia, Canada
| | - Yuxuan Hu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Han Xu
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Mingfeng Huang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Kuo Song
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Hongda Wang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Yongqiang Dong
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chaoqun Jiang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Chenxing Zhang
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | - Songwei Jia
- School of Computer Science and Technology, Xidian University, Xi’an, China
| |
Collapse
|
23
|
Zeng C, Thomas DC, Lewinger JP. Incorporating prior knowledge into regularized regression. Bioinformatics 2021; 37:514-521. [PMID: 32915960 PMCID: PMC8599719 DOI: 10.1093/bioinformatics/btaa776] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 08/13/2020] [Accepted: 09/01/2020] [Indexed: 01/15/2023] Open
Abstract
MOTIVATION Associated with genomic features like gene expression, methylation and genotypes, used in statistical modeling of health outcomes, there is a rich set of meta-features like functional annotations, pathway information and knowledge from previous studies, that can be used post hoc to facilitate the interpretation of a model. However, using this meta-feature information a priori rather than post hoc can yield improved prediction performance as well as enhanced model interpretation. RESULTS We propose a new penalized regression approach that allows a priori integration of external meta-features. The method extends LASSO regression by incorporating individualized penalty parameters for each regression coefficient. The penalty parameters are, in turn, modeled as a log-linear function of the meta-features and are estimated from the data using an approximate empirical Bayes approach. Optimization of the marginal likelihood on which the empirical Bayes estimation is performed using a fast and stable majorization-minimization procedure. Through simulations, we show that the proposed regression with individualized penalties can outperform the standard LASSO in terms of both parameters estimation and prediction performance when the external data is informative. We further demonstrate our approach with applications to gene expression studies of bone density and breast cancer. AVAILABILITY AND IMPLEMENTATION The methods have been implemented in the R package xtune freely available for download from https://cran.r-project.org/web/packages/xtune/index.html.
Collapse
Affiliation(s)
- Chubing Zeng
- Division of Biostatistics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Duncan Campbell Thomas
- Division of Biostatistics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Juan Pablo Lewinger
- Division of Biostatistics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| |
Collapse
|
24
|
Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. NATURE COMPUTATIONAL SCIENCE 2021; 1:395-402. [PMID: 38217236 DOI: 10.1038/s43588-021-00086-z] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 05/17/2021] [Indexed: 01/15/2024]
Abstract
Multi-omics approaches have become a reality in both large genomics projects and small laboratories. However, the multi-omics research community still faces a number of issues that have either not been sufficiently discussed or for which current solutions are still limited. In this Perspective, we elaborate on these limitations and suggest points of attention for future research. We finally discuss new opportunities and challenges brought to the field by the rapid development of single-cell high-throughput molecular technologies.
Collapse
Affiliation(s)
- Sonia Tarazona
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Angeles Arzalluz-Luque
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Ana Conesa
- Microbiology and Cell Science Department, Institute for Food and Agricultural Research, University of Florida, Gainesville, FL, USA.
- Genetics Institute, University of Florida, Gainesville, FL, USA.
- Institute for Integrative Systems Biology, Spanish National Research Council, Valencia, Spain.
| |
Collapse
|
25
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
26
|
van de Wiel MA, van Nee MM, Rauschenberger A. Fast Cross-validation for Multi-penalty High-dimensional Ridge Regression. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1904962] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Mark A. van de Wiel
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Mirrelijn M. van Nee
- Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Esch-sur-Alzette, Luxembourg
| |
Collapse
|
27
|
Magazzù G, Zampieri G, Angione C. Multimodal regularised linear models with flux balance analysis for mechanistic integration of omics data. Bioinformatics 2021; 37:3546-3552. [PMID: 33974036 DOI: 10.1093/bioinformatics/btab324] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 01/06/2021] [Accepted: 04/27/2021] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION High-throughput biological data, thanks to technological advances, have become cheaper to collect, leading to the availability of vast amounts of omic data of different types. In parallel, the in silico reconstruction and modelling of metabolic systems is now acknowledged as a key tool to complement experimental data on a large scale. The integration of these model- and data-driven information is therefore emerging as a new challenge in systems biology, with no clear guidance on how to better take advantage of the inherent multi-source and multi-omic nature of these data types while preserving mechanistic interpretation. RESULTS Here we investigate different regularisation techniques for high-dimensional data derived from the integration of gene expression profiles with metabolic flux data, extracted from strain-specific metabolic models, to improve cellular growth rate predictions. To this end, we propose ad-hoc extensions of previous regularisation frameworks including group, view-specific and principal component regularisation, and experimentally compare them using data from 1,143 Saccharomyces cerevisiae strains. We observe a divergence between methods in terms of regression accuracy and integration effectiveness based on the type of regularisation employed. In multi-omic regression tasks, when learning from experimental and model-generated omic data, our results demonstrate the competitiveness and ease of interpretation of multimodal regularised linear models compared to data-hungry methods based on neural networks. AVAILABILITY All data, models, and code produced in this work are available on GitHub at https://github.com/Angione-Lab/HybridGroupIPFLasso_pc2Lasso. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Giuseppe Magazzù
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, UK
| | - Guido Zampieri
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, UK.,Department of Biology, University of Padova, Padova, Italy
| | - Claudio Angione
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, UK.,Healthcare Innovation Centre, Teesside University, Middlesbrough, UK.,Centre for Digital Innovation, Teesside University, Middlesbrough, UK
| |
Collapse
|
28
|
Zhao L, Dong Q, Luo C, Wu Y, Bu D, Qi X, Luo Y, Zhao Y. DeepOmix: A scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Comput Struct Biotechnol J 2021; 19:2719-2725. [PMID: 34093987 PMCID: PMC8131983 DOI: 10.1016/j.csbj.2021.04.067] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 04/26/2021] [Accepted: 04/27/2021] [Indexed: 01/23/2023] Open
Abstract
Integrative analysis of multi-omics data can elucidate valuable insights into complex molecular mechanisms for various diseases. However, due to their different modalities and high dimension, utilizing and integrating different types of omics data suffers from great challenges. There is an urgent need to develop a powerful method to improve survival prediction and detect functional gene modules from multi-omics data. To deal with these problems, we present DeepOmix (a scalable and interpretable multi-Omics Deep learning framework and application in cancer survival analysis), a flexible, scalable, and interpretable method for extracting relationships between the clinical survival time and multi-omics data based on a deep learning framework. DeepOmix enables the non-linear combination of variables from different omics datasets and incorporates prior biological information defined by users (such as signaling pathways and tissue networks). Benchmark experiments demonstrate that DeepOmix outperforms the other five cutting-edge prediction methods. Besides, Lower Grade Glioma (LGG) is taken as the case study to perform the prognosis prediction and illustrate the functional module nodes which are associated with the prognostic result in the prediction model.
Collapse
Affiliation(s)
- Lianhe Zhao
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Qiongye Dong
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Chunlong Luo
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yang Wu
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Dechao Bu
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaoning Qi
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yufan Luo
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yi Zhao
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China.,Hwa Mei Hospital, University of Chinese Academy of Sciences, Ningbo 315000, China
| |
Collapse
|
29
|
Cao W, Luo C, Lei M, Shen M, Ding W, Wang M, Song M, Ge J, Zhang Q. Development and Validation of a Dynamic Nomogram to Predict the Risk of Neonatal White Matter Damage. Front Hum Neurosci 2021; 14:584236. [PMID: 33708079 PMCID: PMC7940363 DOI: 10.3389/fnhum.2020.584236] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Accepted: 12/31/2020] [Indexed: 12/23/2022] Open
Abstract
Purpose White matter damage (WMD) was defined as the appearance of rough and uneven echo enhancement in the white matter around the ventricle. The aim of this study was to develop and validate a risk prediction model for neonatal WMD. Materials and Methods We collected data for 1,733 infants hospitalized at the Department of Neonatology at The First Affiliated Hospital of Zhengzhou University from 2017 to 2020. Infants were randomly assigned to training (n = 1,216) or validation (n = 517) cohorts at a ratio of 7:3. Multivariate logistic regression and least absolute shrinkage and selection operator (LASSO) regression analyses were used to establish a risk prediction model and web-based risk calculator based on the training cohort data. The predictive accuracy of the model was verified in the validation cohort. Results We identified four variables as independent risk factors for brain WMD in neonates by multivariate logistic regression and LASSO analysis, including gestational age, fetal distress, prelabor rupture of membranes, and use of corticosteroids. These were used to establish a risk prediction nomogram and web-based calculator (https://caowenjun.shinyapps.io/dynnomapp/). The C-index of the training and validation sets was 0.898 (95% confidence interval: 0.8745-0.9215) and 0.887 (95% confidence interval: 0.8478-0.9262), respectively. Decision tree analysis showed that the model was highly effective in the threshold range of 1-61%. The sensitivity and specificity of the model were 82.5 and 81.7%, respectively, and the cutoff value was 0.099. Conclusion This is the first study describing the use of a nomogram and web-based calculator to predict the risk of WMD in neonates. The web-based calculator increases the applicability of the predictive model and is a convenient tool for doctors at primary hospitals and outpatient clinics, family doctors, and even parents to identify high-risk births early on and implementing appropriate interventions while avoiding excessive treatment of low-risk patients.
Collapse
Affiliation(s)
- Wenjun Cao
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Chenghan Luo
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Mengyuan Lei
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Min Shen
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Wenqian Ding
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Mengmeng Wang
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Min Song
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Jian Ge
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Qian Zhang
- Neonatal Intensive Care Unit, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| |
Collapse
|
30
|
Krautenbacher N, Kabesch M, Horak E, Braun-Fahrländer C, Genuneit J, Boznanski A, von Mutius E, Theis F, Fuchs C, Ege MJ. Asthma in farm children is more determined by genetic polymorphisms and in non-farm children by environmental factors. Pediatr Allergy Immunol 2021; 32:295-304. [PMID: 32997854 DOI: 10.1111/pai.13385] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/22/2020] [Accepted: 09/23/2020] [Indexed: 01/06/2023]
Abstract
BACKGROUND The asthma syndrome is influenced by hereditary and environmental factors. With the example of farm exposure, we study whether genetic and environmental factors interact for asthma. METHODS Statistical learning approaches based on penalized regression and decision trees were used to predict asthma in the GABRIELA study with 850 cases (9% farm children) and 857 controls (14% farm children). Single-nucleotide polymorphisms (SNPs) were selected from a genome-wide dataset based on a literature search or by statistical selection techniques. Prediction was assessed by receiver operating characteristics (ROC) curves and validated in the PASTURE cohort. RESULTS Prediction by family history of asthma and atopy yielded an area under the ROC curve (AUC) of 0.62 [0.57-0.66] in the random forest machine learning approach. By adding information on demographics (sex and age) and 26 environmental exposure variables, the quality of prediction significantly improved (AUC = 0.65 [0.61-0.70]). In farm children, however, environmental variables did not improve prediction quality. Rather SNPs related to IL33 and RAD50 contributed significantly to the prediction of asthma (AUC = 0.70 [0.62-0.78]). CONCLUSIONS Asthma in farm children is more likely predicted by other factors as compared to non-farm children though in both forms, family history may integrate environmental exposure, genotype and degree of penetrance.
Collapse
Affiliation(s)
- Norbert Krautenbacher
- Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.,Center for Mathematics, Chair of Mathematical Modeling of Biological Systems, Technische Universität München, Garching, Germany
| | - Michael Kabesch
- University Children's Hospital Regensburg (KUNO), Regensburg, Germany.,Clinic for Pediatric Pneumology and Neonatology, Hannover Medical School, Hannover, Germany.,The German Center for Lung Research (DZL), Germany
| | - Elisabeth Horak
- Department of Pediatrics and Adolescents, Innsbruck Medical University, Innsbruck, Austria
| | - Charlotte Braun-Fahrländer
- Swiss Tropical and Public Health Institute Basel, Basel, Switzerland.,University of Basel, Basel, Switzerland
| | - Jon Genuneit
- Institute of Epidemiology and Medical Biometry, Ulm University, Ulm, Germany.,Pediatric Epidemiology, Department of Pediatrics, Medical Faculty, Leipzig University, Leipzig, Germany
| | | | - Erika von Mutius
- The German Center for Lung Research (DZL), Germany.,Dr von Hauner Children's Hospital, LMU Munich, Munich, Germany.,Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Asthma and Allergy Prevention, Neuherberg, Germany
| | - Fabian Theis
- Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.,Center for Mathematics, Chair of Mathematical Modeling of Biological Systems, Technische Universität München, Garching, Germany
| | - Christiane Fuchs
- Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.,Center for Mathematics, Chair of Mathematical Modeling of Biological Systems, Technische Universität München, Garching, Germany.,Department of Business Administration and Economics, Bielefeld University, Bielefeld, Germany
| | - Markus J Ege
- The German Center for Lung Research (DZL), Germany.,Dr von Hauner Children's Hospital, LMU Munich, Munich, Germany
| | | |
Collapse
|
31
|
Abstract
In recent biomedical studies, multidimensional profiling, which collects proteomics as well as other types of omics data on the same subjects, is getting increasingly popular. Proteomics, transcriptomics, genomics, epigenomics, and other types of data contain overlapping as well as independent information, which suggests the possibility of integrating multiple types of data to generate more reliable findings/models with better classification/prediction performance. In this chapter, a selective review is conducted on recent data integration techniques for both unsupervised and supervised analysis. The main objective is to provide the "big picture" of data integration that involves proteomics data and discuss the "intuition" beneath the recently developed approaches without invoking too many mathematical details. Potential pitfalls and possible directions for future developments are also discussed.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, TN, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, Yale University, New Haven, CT, USA.
| |
Collapse
|
32
|
Mackay IJ, Cockram J, Howell P, Powell W. Understanding the classics: the unifying concepts of transgressive segregation, inbreeding depression and heterosis and their central relevance for crop breeding. PLANT BIOTECHNOLOGY JOURNAL 2021; 19:26-34. [PMID: 32996672 PMCID: PMC7769232 DOI: 10.1111/pbi.13481] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 09/07/2020] [Accepted: 09/12/2020] [Indexed: 05/12/2023]
Abstract
Transgressive segregation and heterosis are the reasons that plant breeding works. Molecular explanations for both phenomena have been suggested and play a contributing role. However, it is often overlooked by molecular genetic researchers that transgressive segregation and heterosis are most simply explained by dispersion of favorable alleles. Therefore, advances in molecular biology will deliver the most impact on plant breeding when integrated with sources of heritable trait variation - and this will be best achieved within a quantitative genetics framework. An example of the power of quantitative approaches is the implementation of genomic selection, which has recently revolutionized animal breeding. Genomic selection is now being applied to both hybrid and inbred crops and is likely to be the major source of improvement in plant breeding practice over the next decade. Breeders' ability to efficiently apply genomic selection methodologies is due to recent technology advances in genotyping and sequencing. Furthermore, targeted integration of additional molecular data (such as gene expression, gene copy number and methylation status) into genomic prediction models may increase their performance. In this review, we discuss and contextualize a suite of established quantitative genetics themes relating to hybrid vigour, transgressive segregation and their central relevance to plant breeding, with the aim of informing crop researchers outside of the quantitative genetics discipline of their relevance and importance to crop improvement. Better understanding between molecular and quantitative disciplines will increase the potential for further improvements in plant breeding methodologies and so help underpin future food security.
Collapse
Affiliation(s)
- Ian J. Mackay
- SRUC (Scotland’s Rural College)EdinburghUK
- IMplant ConsultancyChelmsfordUK
| | | | | | | |
Collapse
|
33
|
Klosa J, Simon N, Westermark PO, Liebscher V, Wittenburg D. Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent. BMC Bioinformatics 2020; 21:407. [PMID: 32933477 PMCID: PMC7493359 DOI: 10.1186/s12859-020-03725-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 08/31/2020] [Indexed: 11/15/2022] Open
Abstract
Background Statistical analyses of biological problems in life sciences often lead to high-dimensional linear models. To solve the corresponding system of equations, penalization approaches are often the methods of choice. They are especially useful in case of multicollinearity, which appears if the number of explanatory variables exceeds the number of observations or for some biological reason. Then, the model goodness of fit is penalized by some suitable function of interest. Prominent examples are the lasso, group lasso and sparse-group lasso. Here, we offer a fast and numerically cheap implementation of these operators via proximal gradient descent. The grid search for the penalty parameter is realized by warm starts. The step size between consecutive iterations is determined with backtracking line search. Finally, seagull -the R package presented here- produces complete regularization paths. Results Publicly available high-dimensional methylation data are used to compare seagull to the established R package SGL. The results of both packages enabled a precise prediction of biological age from DNA methylation status. But even though the results of seagull and SGL were very similar (R2 > 0.99), seagull computed the solution in a fraction of the time needed by SGL. Additionally, seagull enables the incorporation of weights for each penalized feature. Conclusions The following operators for linear regression models are available in seagull: lasso, group lasso, sparse-group lasso and Integrative LASSO with Penalty Factors (IPF-lasso). Thus, seagull is a convenient envelope of lasso variants.
Collapse
Affiliation(s)
- Jan Klosa
- Institute of Genetics and Biometry, Leibniz Institute for Farm Animal Biology, 18196, Dummerstorf, Germany
| | - Noah Simon
- Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA
| | - Pål Olof Westermark
- Institute of Genetics and Biometry, Leibniz Institute for Farm Animal Biology, 18196, Dummerstorf, Germany
| | - Volkmar Liebscher
- Institute of Mathematics and Computer Science, University of Greifswald, 17489, Greifswald, Germany
| | - Dörte Wittenburg
- Institute of Genetics and Biometry, Leibniz Institute for Farm Animal Biology, 18196, Dummerstorf, Germany.
| |
Collapse
|
34
|
Herrmann M, Probst P, Hornung R, Jurinovic V, Boulesteix AL. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform 2020; 22:5895463. [PMID: 32823283 PMCID: PMC8138887 DOI: 10.1093/bib/bbaa167] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 06/25/2020] [Accepted: 07/03/2020] [Indexed: 12/18/2022] Open
Abstract
Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.
Collapse
Affiliation(s)
- Moritz Herrmann
- Department of Statistics, Ludwig Maximilian University, Munich, 80539, Germany
| | - Philipp Probst
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Roman Hornung
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Vindi Jurinovic
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| |
Collapse
|
35
|
Belhechmi S, Bin RD, Rotolo F, Michiels S. Accounting for grouped predictor variables or pathways in high-dimensional penalized Cox regression models. BMC Bioinformatics 2020; 21:277. [PMID: 32615919 PMCID: PMC7331150 DOI: 10.1186/s12859-020-03618-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 06/19/2020] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND The standard lasso penalty and its extensions are commonly used to develop a regularized regression model while selecting candidate predictor variables on a time-to-event outcome in high-dimensional data. However, these selection methods focus on a homogeneous set of variables and do not take into account the case of predictors belonging to functional groups; typically, genomic data can be grouped according to biological pathways or to different types of collected data. Another challenge is that the standard lasso penalisation is known to have a high false discovery rate. RESULTS We evaluated different penalizations in a Cox model to select grouped variables in order to further penalize variables that, in addition to having a low effect, belong to a group with a low overall effect; and to favor the selection of variables that, in addition to having a large effect, belong to a group with a large overall effect. We considered the case of prespecified and disjoint groups and proposed diverse weights for the adaptive lasso method. In particular we proposed the product Max Single Wald by Single Wald weighting (MSW*SW) which takes into account the information of the group to which it belongs and of this biomarker. Through simulations, we compared the selection and prediction ability of our approach with the standard lasso, the composite Minimax Concave Penalty (cMCP), the group exponential lasso (gel), the Integrative L1-Penalized Regression with Penalty Factors (IPF-Lasso), and the Sparse Group Lasso (SGL) methods. In addition, we illustrated the methods using gene expression data of 614 breast cancer patients. CONCLUSIONS The adaptive lasso with the MSW*SW weighting method incorporates both the information in the grouping structure and the individual variable. It outperformed the competitors by reducing the false discovery rate without severely increasing the false negative rate.
Collapse
Affiliation(s)
- Shaima Belhechmi
- Université Paris-Saclay, Univ. Paris-Sud, UVSQ, CESP, INSERM U1018 Oncostat, Villejuif, F-94805, France.,Service de biostatistique et d'épidémiologie, Gustave Roussy, Villejuif, F-94805, France
| | | | - Federico Rotolo
- Biostatistics and Data Management Unit, Innate Pharma, Marseille, France
| | - Stefan Michiels
- Université Paris-Saclay, Univ. Paris-Sud, UVSQ, CESP, INSERM U1018 Oncostat, Villejuif, F-94805, France. .,Service de biostatistique et d'épidémiologie, Gustave Roussy, Villejuif, F-94805, France.
| |
Collapse
|
36
|
Shi WJ, Zhuang Y, Russell PH, Hobbs BD, Parker MM, Castaldi PJ, Rudra P, Vestal B, Hersh CP, Saba LM, Kechris K. Unsupervised discovery of phenotype-specific multi-omics networks. Bioinformatics 2020; 35:4336-4343. [PMID: 30957844 DOI: 10.1093/bioinformatics/btz226] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 02/01/2019] [Accepted: 04/05/2019] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Complex diseases often involve a wide spectrum of phenotypic traits. Better understanding of the biological mechanisms relevant to each trait promotes understanding of the etiology of the disease and the potential for targeted and effective treatment plans. There have been many efforts towards omics data integration and network reconstruction, but limited work has examined the incorporation of relevant (quantitative) phenotypic traits. RESULTS We propose a novel technique, sparse multiple canonical correlation network analysis (SmCCNet), for integrating multiple omics data types along with a quantitative phenotype of interest, and for constructing multi-omics networks that are specific to the phenotype. As a case study, we focus on miRNA-mRNA networks. Through simulations, we demonstrate that SmCCNet has better overall prediction performance compared to popular gene expression network construction and integration approaches under realistic settings. Applying SmCCNet to studies on chronic obstructive pulmonary disease (COPD) and breast cancer, we found enrichment of known relevant pathways (e.g. the Cadherin pathway for COPD and the interferon-gamma signaling pathway for breast cancer) as well as less known omics features that may be important to the diseases. Although those applications focus on miRNA-mRNA co-expression networks, SmCCNet is applicable to a variety of omics and other data types. It can also be easily generalized to incorporate multiple quantitative phenotype simultaneously. The versatility of SmCCNet suggests great potential of the approach in many areas. AVAILABILITY AND IMPLEMENTATION The SmCCNet algorithm is written in R, and is freely available on the web at https://cran.r-project.org/web/packages/SmCCNet/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- W Jenny Shi
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Yonghua Zhuang
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Pamela H Russell
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Brian D Hobbs
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA.,Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Margaret M Parker
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Peter J Castaldi
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Pratyaydipta Rudra
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.,Department of Statistics, Oklahoma State University, Stillwater, OK
| | - Brian Vestal
- Center for Genes, Environment & Health, National Jewish Health, Denver, CO, USA
| | - Craig P Hersh
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA.,Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Laura M Saba
- Department of Pharmaceutical Sciences, University of Colorado, Aurora, CO, USA
| | - Katerina Kechris
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| |
Collapse
|
37
|
Zhang X, de Leon J, Crespo-Facorro B, Diaz FJ. Measuring individual benefits of psychiatric treatment using longitudinal binary outcomes: Application to antipsychotic benefits in non-cannabis and cannabis users. J Biopharm Stat 2020; 30:916-940. [DOI: 10.1080/10543406.2020.1765371] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Xuan Zhang
- Department of Biostatistics, The University of Kansas Medical Center, Kansas City, KS, United States
- Boston Strategic Partners, Inc, Boston, MA, United States
| | - Jose de Leon
- Mental Health Research Center at Eastern State Hospital, Lexington, KY, United States
| | - Benedicto Crespo-Facorro
- University Hospital Virgen Del Rocío, Seville, Spain
- CIBERSAM G26-IBiS, University of Seville, Seville, Spain
- Department of Psychiatry, Marqués De Valdecilla University Hospital, IDIVAL, Santander, Spain
- School of Medicine, University of Cantabria, Santander, Spain
| | - Francisco J. Diaz
- Department of Biostatistics, The University of Kansas Medical Center, Kansas City, KS, United States
| |
Collapse
|
38
|
Zhao Z, Zucknick M. Structured penalized regression for drug sensitivity prediction. J R Stat Soc Ser C Appl Stat 2020. [DOI: 10.1111/rssc.12400] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
39
|
Oh M, Park S, Kim S, Chae H. Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations. Brief Bioinform 2020; 22:66-76. [PMID: 32227074 DOI: 10.1093/bib/bbaa032] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Revised: 02/05/2020] [Accepted: 02/25/2020] [Indexed: 02/06/2023] Open
Abstract
Gene expressions are subtly regulated by quantifiable measures of genetic molecules such as interaction with other genes, methylation, mutations, transcription factor and histone modifications. Integrative analysis of multi-omics data can help scientists understand the condition or patient-specific gene regulation mechanisms. However, analysis of multi-omics data is challenging since it requires not only the analysis of multiple omics data sets but also mining complex relations among different genetic molecules by using state-of-the-art machine learning methods. In addition, analysis of multi-omics data needs quite large computing infrastructure. Moreover, interpretation of the analysis results requires collaboration among many scientists, often requiring reperforming analysis from different perspectives. Many of the aforementioned technical issues can be nicely handled when machine learning tools are deployed on the cloud. In this survey article, we first survey machine learning methods that can be used for gene regulation study, and we categorize them according to five different goals: gene regulatory subnetwork discovery, disease subtype analysis, survival analysis, clinical prediction and visualization. We also summarize the methods in terms of multi-omics input types. Then, we explain why the cloud is potentially a good solution for the analysis of multi-omics data, followed by a survey of two state-of-the-art cloud systems, Galaxy and BioVLAB. Finally, we discuss important issues when the cloud is used for the analysis of multi-omics data for the gene regulation study.
Collapse
Affiliation(s)
- Minsik Oh
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea
| | - Sungjoon Park
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Seoul, 08826, Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, Korea.,Bioinformatics Institute, Seoul National University, Seoul, 08826, Korea
| | - Heejoon Chae
- Division of Computer Science, Sookmyung Women's University, Seoul, 04310,Korea
| |
Collapse
|
40
|
Jagdhuber R, Lang M, Stenzl A, Neuhaus J, Rahnenführer J. Cost-Constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms. BMC Bioinformatics 2020; 21:26. [PMID: 31992203 PMCID: PMC6986087 DOI: 10.1186/s12859-020-3361-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 01/10/2020] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND With modern methods in biotechnology, the search for biomarkers has advanced to a challenging statistical task exploring high dimensional data sets. Feature selection is a widely researched preprocessing step to handle huge numbers of biomarker candidates and has special importance for the analysis of biomedical data. Such data sets often include many input features not related to the diagnostic or therapeutic target variable. A less researched, but also relevant aspect for medical applications are costs of different biomarker candidates. These costs are often financial costs, but can also refer to other aspects, for example the decision between a painful biopsy marker and a simple urine test. In this paper, we propose extensions to two feature selection methods to control the total amount of such costs: greedy forward selection and genetic algorithms. In comprehensive simulation studies of binary classification tasks, we compare the predictive performance, the run-time and the detection rate of relevant features for the new proposed methods and five baseline alternatives to handle budget constraints. RESULTS In simulations with a predefined budget constraint, our proposed methods outperform the baseline alternatives, with just minor differences between them. Only in the scenario without an actual budget constraint, our adapted greedy forward selection approach showed a clear drop in performance compared to the other methods. However, introducing a hyperparameter to adapt the benefit-cost trade-off in this method could overcome this weakness. CONCLUSIONS In feature cost scenarios, where a total budget has to be met, common feature selection algorithms are often not suitable to identify well performing subsets for a modelling task. Adaptations of these algorithms such as the ones proposed in this paper can help to tackle this problem.
Collapse
Affiliation(s)
- Rudolf Jagdhuber
- Department of Statistics, TU Dortmund, Vogelpothsweg 87, Dortmund, 44227 Germany
- numares AG, Am BioPark 9, Regensburg, 93053 Germany
| | - Michel Lang
- Department of Statistics, TU Dortmund, Vogelpothsweg 87, Dortmund, 44227 Germany
| | - Arnulf Stenzl
- Klinik für Urologie, Universitätsklinikum Tübingen, Hoppe-Seyler-Str. 3, Tübingen, 72076 Germany
| | - Jochen Neuhaus
- Universitätsklinikum Leipzig AöR, Department für Operative Medizin, Klinik und Poliklinik für Urologie, Liebigstr. 20, Leipzig, 04103 Germany
| | - Jörg Rahnenführer
- Department of Statistics, TU Dortmund, Vogelpothsweg 87, Dortmund, 44227 Germany
| |
Collapse
|
41
|
Abstract
AbstractThis paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two high-dimensional covariate sets. We assume a one-to-one correspondence between the covariate sets, with each covariate in one set forming a pair with a covariate in the other set. Paired covariates arise, for example, when two transformations of the same data are available. It is often unknown which of the two covariate sets leads to better predictions, or whether the two covariate sets complement each other. The paired lasso addresses this problem by weighting the covariates to improve the selection from the covariate sets and the covariate pairs. It thereby combines information from both covariate sets and accounts for the paired structure. We tested the paired lasso on more than 2000 classification problems with experimental genomics data, and found that for estimating sparse but predictive models, the paired lasso outperforms the standard and the adaptive lasso. The R package is available from cran.
Collapse
|
42
|
Chase EC, Boonstra PS. Accounting for established predictors with the multistep elastic net. Stat Med 2019; 38:4534-4544. [PMID: 31313344 DOI: 10.1002/sim.8313] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Revised: 04/27/2019] [Accepted: 06/17/2019] [Indexed: 12/17/2022]
Abstract
Multivariable models for prediction or estimating associations with an outcome are rarely built in isolation. Instead, they are based upon a mixture of covariates that have been evaluated in earlier studies (eg, age, sex, or common biomarkers) and covariates that were collected specifically for the current study (eg, a panel of novel biomarkers or other hypothesized risk factors). For that context, we present the multistep elastic net (MSN), which considers penalized regression with variables that can be qualitatively grouped based upon their degree of prior research support: established predictors vs unestablished predictors. The MSN chooses between uniform penalization of all predictors (the standard elastic net) and weaker penalization of the established predictors in a cross-validated framework and includes the option to impose zero penalty on the established predictors. In simulation studies that reflect the motivating context, we show the comparability or superiority of the MSN over the standard elastic net, the Integrative LASSO with Penalty Factors, the sparse group lasso, and the group lasso, and we investigate the importance of not penalizing the established predictors at all. We demonstrate the MSN to update a prediction model for pediatric ECMO patient mortality.
Collapse
Affiliation(s)
- Elizabeth C Chase
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| | - Philip S Boonstra
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| |
Collapse
|
43
|
Velten B, Huber W. Adaptive penalization in high-dimensional regression and classification with external covariates using variational Bayes. Biostatistics 2019; 22:348-364. [PMID: 31596468 PMCID: PMC8036004 DOI: 10.1093/biostatistics/kxz034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Revised: 06/27/2019] [Accepted: 08/14/2019] [Indexed: 12/18/2022] Open
Abstract
Penalization schemes like Lasso or ridge regression are routinely used to regress a response of interest on a high-dimensional set of potential predictors. Despite being decisive, the question of the relative strength of penalization is often glossed over and only implicitly determined by the scale of individual predictors. At the same time, additional information on the predictors is available in many applications but left unused. Here, we propose to make use of such external covariates to adapt the penalization in a data-driven manner. We present a method that differentially penalizes feature groups defined by the covariates and adapts the relative strength of penalization to the information content of each group. Using techniques from the Bayesian tool-set our procedure combines shrinkage with feature selection and provides a scalable optimization scheme. We demonstrate in simulations that the method accurately recovers the true effect sizes and sparsity patterns per feature group. Furthermore, it leads to an improved prediction performance in situations where the groups have strong differences in dynamic range. In applications to data from high-throughput biology, the method enables re-weighting the importance of feature groups from different assays. Overall, using available covariates extends the range of applications of penalized regression, improves model interpretability and can improve prediction performance.
Collapse
Affiliation(s)
- Britta Velten
- Genome Biology Unit, European Molecular Biology Laboratory, Meyerhofstr. 1, 69117 Heidelberg, Germany
| | - Wolfgang Huber
- Genome Biology Unit, European Molecular Biology Laboratory, Meyerhofstr. 1, 69117 Heidelberg, Germany
| |
Collapse
|
44
|
Richter J, Madjar K, Rahnenführer J. Model-based optimization of subgroup weights for survival analysis. Bioinformatics 2019; 35:i484-i491. [PMID: 31510644 PMCID: PMC6612842 DOI: 10.1093/bioinformatics/btz361] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Motivation To obtain a reliable prediction model for a specific cancer subgroup or cohort is often difficult due to limited sample size and, in survival analysis, due to potentially high censoring rates. Sometimes similar data from other patient subgroups are available, e.g. from other clinical centers. Simple pooling of all subgroups can decrease the variance of the predicted parameters of the prediction models, but also increase the bias due to heterogeneity between the cohorts. A promising compromise is to identify those subgroups with a similar relationship between covariates and target variable and then include only these for model building. Results We propose a subgroup-based weighted likelihood approach for survival prediction with high-dimensional genetic covariates. When predicting survival for a specific subgroup, for every other subgroup an individual weight determines the strength with which its observations enter into model building. MBO (model-based optimization) can be used to quickly find a good prediction model in the presence of a large number of hyperparameters. We use MBO to identify the best model for survival prediction of a specific subgroup by optimizing the weights for additional subgroups for a Cox model. The approach is evaluated on a set of lung cancer cohorts with gene expression measurements. The resulting models have competitive prediction quality, and they reflect the similarity of the corresponding cancer subgroups, with both weights close to 0 and close to 1 and medium weights. Availability and implementation mlrMBO is implemented as an R-package and is freely available at http://github.com/mlr-org/mlrMBO.
Collapse
Affiliation(s)
- Jakob Richter
- Department of Statistics, TU Dortmund University, Dortmund, Germany
| | - Katrin Madjar
- Department of Statistics, TU Dortmund University, Dortmund, Germany
| | | |
Collapse
|
45
|
Krautenbacher N, Flach N, Böck A, Laubhahn K, Laimighofer M, Theis FJ, Ankerst DP, Fuchs C, Schaub B. A strategy for high-dimensional multivariable analysis classifies childhood asthma phenotypes from genetic, immunological, and environmental factors. Allergy 2019; 74:1364-1373. [PMID: 30737985 PMCID: PMC6767756 DOI: 10.1111/all.13745] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2018] [Revised: 12/22/2018] [Accepted: 01/06/2019] [Indexed: 12/14/2022]
Abstract
Background Associations between childhood asthma phenotypes and genetic, immunological, and environmental factors have been previously established. Yet, strategies to integrate high‐dimensional risk factors from multiple distinct data sets, and thereby increase the statistical power of analyses, have been hampered by a preponderance of missing data and lack of methods to accommodate them. Methods We assembled questionnaire, diagnostic, genotype, microarray, RT‐qPCR, flow cytometry, and cytokine data (referred to as data modalities) to use as input factors for a classifier that could distinguish healthy children, mild‐to‐moderate allergic asthmatics, and nonallergic asthmatics. Based on data from 260 German children aged 4‐14 from our university outpatient clinic, we built a novel multilevel prediction approach for asthma outcome which could deal with a present complex missing data structure. Results The optimal learning method was boosting based on all data sets, achieving an area underneath the receiver operating characteristic curve (AUC) for three classes of phenotypes of 0.81 (95%‐confidence interval (CI): 0.65‐0.94) using leave‐one‐out cross‐validation. Besides improving the AUC, our integrative multilevel learning approach led to tighter CIs than using smaller complete predictor data sets (AUC = 0.82 [0.66‐0.94] for boosting). The most important variables for classifying childhood asthma phenotypes comprised novel identified genes, namely PKN2 (protein kinase N2), PTK2 (protein tyrosine kinase 2), and ALPP (alkaline phosphatase, placental). Conclusion Our combination of several data modalities using a novel strategy improved classification of childhood asthma phenotypes but requires validation in external populations. The generic approach is applicable to other multilevel data‐based risk prediction settings, which typically suffer from incomplete data.
Collapse
Affiliation(s)
- Norbert Krautenbacher
- Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany
- Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
| | - Nicolai Flach
- Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany
- Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
| | - Andreas Böck
- Department of Pulmonary and Allergy Dr. von Hauner Children's Hospital LMU Munich Germany
| | - Kristina Laubhahn
- Department of Pulmonary and Allergy Dr. von Hauner Children's Hospital LMU Munich Germany
- Member of German Lung Centre (DZL) CPC Munich Germany
| | - Michael Laimighofer
- Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany
- Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
| | - Fabian J. Theis
- Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany
- Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
| | - Donna P. Ankerst
- Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
- University of Texas Health Science Center at San Antonio San Antonio Texas
| | - Christiane Fuchs
- Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany
- Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
- Faculty of Business Administration and Economics Bielefeld University Bielefeld Germany
| | - Bianca Schaub
- Department of Pulmonary and Allergy Dr. von Hauner Children's Hospital LMU Munich Germany
- Member of German Lung Centre (DZL) CPC Munich Germany
| |
Collapse
|
46
|
Hornung R, Wright MN. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics 2019; 20:358. [PMID: 31248362 PMCID: PMC6598279 DOI: 10.1186/s12859-019-2942-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 06/07/2019] [Indexed: 12/25/2022] Open
Abstract
Background In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. Results We identify one variant termed “block forest” that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. Conclusions The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type. Electronic supplementary material The online version of this article (10.1186/s12859-019-2942-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Roman Hornung
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany.
| | - Marvin N Wright
- Leibniz Institute for Prevention Research and Epidemiology - BIPS, Achterstr. 30, Bremen, 28359, Germany.,Section of Biostatistics, Department of Public Health, University of Copenhagen, Øster Farimagsgade 5, Copenhagen, 1014, Denmark
| |
Collapse
|
47
|
López de Maturana E, Alonso L, Alarcón P, Martín-Antoniano IA, Pineda S, Piorno L, Calle ML, Malats N. Challenges in the Integration of Omics and Non-Omics Data. Genes (Basel) 2019; 10:genes10030238. [PMID: 30897838 PMCID: PMC6471713 DOI: 10.3390/genes10030238] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Revised: 03/05/2019] [Accepted: 03/14/2019] [Indexed: 11/16/2022] Open
Abstract
Omics data integration is already a reality. However, few omics-based algorithms show enough predictive ability to be implemented into clinics or public health domains. Clinical/epidemiological data tend to explain most of the variation of health-related traits, and its joint modeling with omics data is crucial to increase the algorithm’s predictive ability. Only a small number of published studies performed a “real” integration of omics and non-omics (OnO) data, mainly to predict cancer outcomes. Challenges in OnO data integration regard the nature and heterogeneity of non-omics data, the possibility of integrating large-scale non-omics data with high-throughput omics data, the relationship between OnO data (i.e., ascertainment bias), the presence of interactions, the fairness of the models, and the presence of subphenotypes. These challenges demand the development and application of new analysis strategies to integrate OnO data. In this contribution we discuss different attempts of OnO data integration in clinical and epidemiological studies. Most of the reviewed papers considered only one type of omics data set, mainly RNA expression data. All selected papers incorporated non-omics data in a low-dimensionality fashion. The integrative strategies used in the identified papers adopted three modeling methods: Independent, conditional, and joint modeling. This review presents, discusses, and proposes integrative analytical strategies towards OnO data integration.
Collapse
Affiliation(s)
- Evangelina López de Maturana
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lola Alonso
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Pablo Alarcón
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Isabel Adoración Martín-Antoniano
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Silvia Pineda
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - Lucas Piorno
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| | - M Luz Calle
- Biosciences Department, University of Vic-Central University of Catalonia, Carrer de la Laura 13, 08570 Vic, Spain.
| | - Núria Malats
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), and CIBERONC, Melchor Fernández Almagro 3, 28029 Madrid, Spain.
| |
Collapse
|
48
|
van de Wiel MA, Te Beest DE, Münch MM. Learning from a lot: Empirical Bayes for high-dimensional model-based prediction. Scand Stat Theory Appl 2019; 46:2-25. [PMID: 31007342 PMCID: PMC6472625 DOI: 10.1111/sjos.12335] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Revised: 01/24/2018] [Accepted: 03/22/2018] [Indexed: 12/21/2022]
Abstract
Empirical Bayes is a versatile approach to "learn from a lot" in two ways: first, from a large number of variables and, second, from a potentially large amount of prior information, for example, stored in public repositories. We review applications of a variety of empirical Bayes methods to several well-known model-based prediction methods, including penalized regression, linear discriminant analysis, and Bayesian models with sparse or dense priors. We discuss "formal" empirical Bayes methods that maximize the marginal likelihood but also more informal approaches based on other data summaries. We contrast empirical Bayes to cross-validation and full Bayes and discuss hybrid approaches. To study the relation between the quality of an empirical Bayes estimator and p, the number of variables, we consider a simple empirical Bayes estimator in a linear model setting. We argue that empirical Bayes is particularly useful when the prior contains multiple parameters, which model a priori information on variables termed "co-data". In particular, we present two novel examples that allow for co-data: first, a Bayesian spike-and-slab setting that facilitates inclusion of multiple co-data sources and types and, second, a hybrid empirical Bayes-full Bayes ridge regression approach for estimation of the posterior predictive interval.
Collapse
Affiliation(s)
- Mark A. van de Wiel
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
- Department of MathematicsVU UniversityAmsterdamThe Netherlands
| | - Dennis E. Te Beest
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
| | - Magnus M. Münch
- Department of Epidemiology and Biostatistics, Amsterdam Public Health Research InstituteVU University Medical CenterAmsterdamThe Netherlands
- Mathematical Institute, Faculty of ScienceLeiden UniversityLeidenThe Netherlands
| |
Collapse
|
49
|
Chauvel C, Novoloaca A, Veyre P, Reynier F, Becker J. Evaluation of integrative clustering methods for the analysis of multi-omics data. Brief Bioinform 2019; 21:541-552. [DOI: 10.1093/bib/bbz015] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 01/12/2019] [Accepted: 01/16/2019] [Indexed: 12/20/2022] Open
Abstract
Abstract
Recent advances in sequencing, mass spectrometry and cytometry technologies have enabled researchers to collect large-scale omics data from the same set of biological samples. The joint analysis of multiple omics offers the opportunity to uncover coordinated cellular processes acting across different omic layers. In this work, we present a thorough comparison of a selection of recent integrative clustering approaches, including Bayesian (BCC and MDI) and matrix factorization approaches (iCluster, moCluster, JIVE and iNMF). Based on simulations, the methods were evaluated on their sensitivity and their ability to recover both the correct number of clusters and the simulated clustering at the common and data-specific levels. Standard non-integrative approaches were also included to quantify the added value of integrative methods. For most matrix factorization methods and one Bayesian approach (BCC), the shared and specific structures were successfully recovered with high and moderate accuracy, respectively. An opposite behavior was observed on non-integrative approaches, i.e. high performances on specific structures only. Finally, we applied the methods on the Cancer Genome Atlas breast cancer data set to check whether results based on experimental data were consistent with those obtained in the simulations.
Collapse
Affiliation(s)
- Cécile Chauvel
- BIOASTER Research Institute, avenue Tony Garnier, Lyon, France
| | | | - Pierre Veyre
- BIOASTER Research Institute, avenue Tony Garnier, Lyon, France
| | | | - Jérémie Becker
- BIOASTER Research Institute, avenue Tony Garnier, Lyon, France
| |
Collapse
|
50
|
Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinformatics 2018; 19:322. [PMID: 30208855 PMCID: PMC6134797 DOI: 10.1186/s12859-018-2344-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 08/29/2018] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND The inclusion of high-dimensional omics data in prediction models has become a well-studied topic in the last decades. Although most of these methods do not account for possibly different types of variables in the set of covariates available in the same dataset, there are many such scenarios where the variables can be structured in blocks of different types, e.g., clinical, transcriptomic, and methylation data. To date, there exist a few computationally intensive approaches that make use of block structures of this kind. RESULTS In this paper we present priority-Lasso, an intuitive and practical analysis strategy for building prediction models based on Lasso that takes such block structures into account. It requires the definition of a priority order of blocks of data. Lasso models are calculated successively for every block and the fitted values of every step are included as an offset in the fit of the next step. We apply priority-Lasso in different settings on an acute myeloid leukemia (AML) dataset consisting of clinical variables, cytogenetics, gene mutations and expression variables, and compare its performance on an independent validation dataset to the performance of standard Lasso models. CONCLUSION The results show that priority-Lasso is able to keep pace with Lasso in terms of prediction accuracy. Variables of blocks with higher priorities are favored over variables of blocks with lower priority, which results in easily usable and transportable models for clinical practice.
Collapse
|