Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Boulesteix AL, De Bin R, Jiang X, Fuchs M. IPF-LASSO: Integrative L₁-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data. Comput Math Methods Med 2017;2017:7691937. [PMID: 28546826 DOI: 10.1155/2017/7691937] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2017] [Accepted: 03/14/2017] [Indexed: 11/29/2022]

For:	Boulesteix AL, De Bin R, Jiang X, Fuchs M. IPF-LASSO: Integrative L₁-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data. Comput Math Methods Med 2017;2017:7691937. [PMID: 28546826 DOI: 10.1155/2017/7691937] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2017] [Accepted: 03/14/2017] [Indexed: 11/29/2022]

Number

Cited by Other Article(s)

Djordjilović V, Ponzi E, Nøst TH, Thoresen M. penalizedclr: an R package for penalized conditional logistic regression for integration of multiple omics layers. BMC Bioinformatics 2024;25:226. [PMID: 38937668 PMCID: PMC11212437 DOI: 10.1186/s12859-024-05850-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 06/20/2024] [Indexed: 06/29/2024] Open

Buch G, Schulz A, Schmidtmann I, Strauch K, Wild PS. Sparse Group Penalties for bi-level variable selection. Biom J 2024;66:e2200334. [PMID: 38747086 DOI: 10.1002/bimj.202200334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 02/05/2024] [Accepted: 02/07/2024] [Indexed: 06/29/2024]

Chai H, Lin S, Lin J, He M, Yang Y, OuYang Y, Zhao H. An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome. BMC Bioinformatics 2024;25:88. [PMID: 38418940 PMCID: PMC10902951 DOI: 10.1186/s12859-024-05716-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 02/21/2024] [Indexed: 03/02/2024] Open

Downing T, Angelopoulos N. A primer on correlation-based dimension reduction methods for multi-omics analysis. J R Soc Interface 2023;20:20230344. [PMID: 37817584 PMCID: PMC10565429 DOI: 10.1098/rsif.2023.0344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 09/19/2023] [Indexed: 10/12/2023] Open

Wang Q, He M, Guo L, Chai H. AFEI: adaptive optimized vertical federated learning for heterogeneous multi-omics data integration. Brief Bioinform 2023;24:bbad269. [PMID: 37497720 DOI: 10.1093/bib/bbad269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 06/26/2023] [Accepted: 07/04/2023] [Indexed: 07/28/2023] Open

van Nee MM, Wessels LFA, van de Wiel MA. ecpc: an R-package for generic co-data models for high-dimensional prediction. BMC Bioinformatics 2023;24:172. [PMID: 37101151 PMCID: PMC10134536 DOI: 10.1186/s12859-023-05289-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2022] [Accepted: 04/12/2023] [Indexed: 04/28/2023] Open

Abstract

BACKGROUND

High-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable-specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, were handled by adaptive discretisation, potentially inefficiently modelling and losing information. As continuous co-data such as external p values or correlations often arise in practice, more generic co-data models are needed.

RESULTS

Here, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. The extension is faster than the original method and shows improved prediction and variable selection performance for non-linear co-data relations. Moreover, we demonstrate use of the package in several genomics examples throughout the paper.

CONCLUSIONS

The R-package ecpc accommodates linear, generalised additive and shape constrained additive co-data models for the purpose of improved high-dimensional prediction and variable selection. The extended version of the package as presented here (version number 3.1.1 and higher) is available on ( https://cran.r-project.org/web/packages/ecpc/ ).

Collapse

Zhang R, Datta S. Adaptive Sparse Multi-Block PLS Discriminant Analysis: An Integrative Method for Identifying Key Biomarkers from Multi-Omics Data. Genes (Basel) 2023;14:genes14050961. [PMID: 37239321 DOI: 10.3390/genes14050961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 04/06/2023] [Accepted: 04/21/2023] [Indexed: 05/28/2023] Open

Zhang R, Datta S. asmbPLS: Adaptive Sparse Multi-block Partial Least Square for Survival Prediction using Multi-Omics Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.03.535442. [PMID: 37066143 PMCID: PMC10103991 DOI: 10.1101/2023.04.03.535442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]

Zhong T, Zhang Q, Huang J, Wu M, Ma S. HETEROGENEITY ANALYSIS VIA INTEGRATING MULTI-SOURCES HIGH-DIMENSIONAL DATA WITH APPLICATIONS TO CANCER STUDIES. Stat Sin 2023;33:729-758. [PMID: 38037567 PMCID: PMC10686523 DOI: 10.5705/ss.202021.0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]

Tay JK, Aghaeepour N, Hastie T, Tibshirani R. Feature-weighted elastic net: using "features of features" for better prediction. Stat Sin 2023;33:259-279. [PMID: 37102071 PMCID: PMC10129060 DOI: 10.5705/ss.202020.0226] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Ng HM, Jiang B, Wong KY. Penalized estimation of a class of single-index varying-coefficient models for integrative genomic analysis. Biom J 2023;65:e2100139. [PMID: 35837982 DOI: 10.1002/bimj.202100139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 04/15/2022] [Accepted: 05/27/2022] [Indexed: 01/17/2023]

van Nee MM, van de Brug T, van de Wiel MA. Fast Marginal Likelihood Estimation of Penalties for Group-Adaptive Elastic Net. J Comput Graph Stat 2022;32:950-960. [PMID: 38013849 PMCID: PMC10511031 DOI: 10.1080/10618600.2022.2128809] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 09/12/2022] [Indexed: 10/10/2022]

Prognostic Gene Expression-Based Signature in Clear-Cell Renal Cell Carcinoma. Cancers (Basel) 2022;14:cancers14153754. [PMID: 35954418 PMCID: PMC9367562 DOI: 10.3390/cancers14153754] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 07/21/2022] [Accepted: 07/22/2022] [Indexed: 02/01/2023] Open

Zhao Z, Wang S, Zucknick M, Aittokallio T. Tissue-specific identification of multi-omics features for pan-cancer drug response prediction. iScience 2022;25:104767. [PMID: 35992090 PMCID: PMC9385562 DOI: 10.1016/j.isci.2022.104767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 06/28/2022] [Accepted: 07/11/2022] [Indexed: 11/29/2022] Open

He H, Guo X, Yu J, Ai C, Shi S. Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking. Bioinformatics 2022;38:1542-1549. [PMID: 34908103 DOI: 10.1093/bioinformatics/btab848] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 12/08/2021] [Accepted: 12/13/2021] [Indexed: 02/03/2023] Open

Abstract

MOTIVATION

Efficiently identifying genes based on gene expression level have been studied to help to classify different cancer types and improve the prediction performance. Logistic regression model based on regularization technique is often one of the effective approaches for simultaneously realizing prediction and feature (gene) selection in genomic data of high dimensionality. However, standard methods ignore biological group structure and generally result in poorer predictive models.

RESULTS

In this article, we develop a classifier named Stacked SGL that satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group. We propose to use stacked generalization to combine different ratios rather than choosing one ratio, which could help to overcome the inadaptability of sparse group lasso for some data. Considering that stacking weakens feature selection, we perform a post hoc feature selection which might slightly reduce predictive performance, but it shows superior in feature selection. Experimental results on simulation demonstrate that our approach enjoys competitive and stable classification performance and lower false discovery rate in feature selection for varying sets of data compared with other regularization methods. In addition, our method presents better accuracy in three public cancer datasets and identifies more powerful discriminatory and potential mutation genes for thyroid carcinoma.

AVAILABILITY AND IMPLEMENTATION

The real data underlying this article are available from https://github.com/huanheaha/Stacked_SGL; https://zenodo.org/record/5761577#.YbAUyciEwk2.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Das S, Mukhopadhyay I. TiMEG: an integrative statistical method for partially missing multi-omics data. Sci Rep 2021;11:24077. [PMID: 34911979 PMCID: PMC8674330 DOI: 10.1038/s41598-021-03034-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2021] [Accepted: 11/24/2021] [Indexed: 11/25/2022] Open

Madjar K, Rahnenführer J. Weighted Cox regression for the prediction of heterogeneous patient subgroups. BMC Med Inform Decis Mak 2021;21:342. [PMID: 34876106 PMCID: PMC8650299 DOI: 10.1186/s12911-021-01698-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 11/23/2021] [Indexed: 01/07/2023] Open

Abstract

BACKGROUND

An important task in clinical medicine is the construction of risk prediction models for specific subgroups of patients based on high-dimensional molecular measurements such as gene expression data. Major objectives in modeling high-dimensional data are good prediction performance and feature selection to find a subset of predictors that are truly associated with a clinical outcome such as a time-to-event endpoint. In clinical practice, this task is challenging since patient cohorts are typically small and can be heterogeneous with regard to their relationship between predictors and outcome. When data of several subgroups of patients with the same or similar disease are available, it is tempting to combine them to increase sample size, such as in multicenter studies. However, heterogeneity between subgroups can lead to biased results and subgroup-specific effects may remain undetected.

METHODS

For this situation, we propose a penalized Cox regression model with a weighted version of the Cox partial likelihood that includes patients of all subgroups but assigns them individual weights based on their subgroup affiliation. The weights are estimated from the data such that patients who are likely to belong to the subgroup of interest obtain higher weights in the subgroup-specific model.

RESULTS

Our proposed approach is evaluated through simulations and application to real lung cancer cohorts, and compared to existing approaches. Simulation results demonstrate that our proposed model is superior to standard approaches in terms of prediction performance and variable selection accuracy when the sample size is small.

CONCLUSIONS

The results suggest that sharing information between subgroups by incorporating appropriate weights into the likelihood can increase power to identify the prognostic covariates and improve risk prediction.

Collapse

Learning social networks from text data using covariate information. STAT METHOD APPL-GER 2021. [DOI: 10.1007/s10260-021-00586-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]

Abstract AbstractAccurately describing the lives of historical figures can be challenging, but unraveling their social structures perhaps is even more so. Historical social network analysis methods can help in this regard and may even illuminate individuals who have been overlooked by historians, but turn out to be influential social connection points. Text data, such as biographies, are a useful source of information for learning historical social networks but the identifcation of links based on text data can be challenging. The Local Poisson Graphical Lasso model models social networks by conditional independence structures, and leverages the number of name co-mentions in the text to infer relationships. However, this method does not take into account the abundance of covariate information that is often available in text data. Conditional independence structure like Poisson Graphical Model, which makes use name mention counts in the text can be useful tools to avoid false positive links due to the co-mentions but given historical tendency of frequently used or common names, without additional distinguishing information, we may introduce incorrect connections. In this work, we therefore extend the Local Poisson Graphical Lasso model with a (multiple) penalty structure that incorporates covariates, opening up the opportunity for similar individuals to have a higher probability of being connected. We propose both greedy and Bayesian approaches to estimate the penalty parameters. We present results on data simulated with characteristics of historical networks and show that this type of penalty structure can improve network recovery as measured by precision and recall. We also illustrate the approach on biographical data of individuals who lived in early modern Britain between 1500 and 1575. We will show how these covariates affect the statistical model’s performance using simulations, discuss how it helps to better identify links for the people with common names and those who are traditionally underrepresented in the biography text data. Collapse

Synergistic Effects of Different Levels of Genomic Data for the Staging of Lung Adenocarcinoma: An Illustrative Study. Genes (Basel) 2021;12:genes12121872. [PMID: 34946821 PMCID: PMC8700916 DOI: 10.3390/genes12121872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Revised: 11/18/2021] [Accepted: 11/24/2021] [Indexed: 11/17/2022] Open

Abstract

Lung adenocarcinoma (LUAD) is a common and very lethal cancer. Accurate staging is a prerequisite for its effective diagnosis and treatment. Therefore, improving the accuracy of the stage prediction of LUAD patients is of great clinical relevance. Previous works have mainly focused on single genomic data information or a small number of different omics data types concurrently for generating predictive models. A few of them have considered multi-omics data from genome to proteome. We used a publicly available dataset to illustrate the potential of multi-omics data for stage prediction in LUAD. In particular, we investigated the roles of the specific omics data types in the prediction process. We used a self-developed method, Omics-MKL, for stage prediction that combines an existing feature ranking technique Minimum Redundancy and Maximum Relevance (mRMR), which avoids redundancy among the selected features, and multiple kernel learning (MKL), applying different kernels for different omics data types. Each of the considered omics data types individually provided useful prediction results. Moreover, using multi-omics data delivered notably better results than using single-omics data. Gene expression and methylation information seem to play vital roles in the staging of LUAD. The Omics-MKL method retained 70 features after the selection process. Of these, 21 (30%) were methylation features and 34 (48.57%) were gene expression features. Moreover, 18 (25.71%) of the selected features are known to be related to LUAD, and 29 (41.43%) to lung cancer in general. Using multi-omics data from genome to proteome for predicting the stage of LUAD seems promising because each omics data type may improve the accuracy of the predictions. Here, methylation and gene expression data may play particularly important roles.

Collapse

van Nee MM, Wessels LFA, van de Wiel MA. Flexible co-data learning for high-dimensional prediction. Stat Med 2021;40:5910-5925. [PMID: 34438466 PMCID: PMC9292202 DOI: 10.1002/sim.9162] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 05/18/2021] [Accepted: 07/29/2021] [Indexed: 02/06/2023]

Han H, Dawson KJ. Applying elastic-net regression to identify the best models predicting changes in civic purpose during the emerging adulthood. J Adolesc 2021;93:20-27. [PMID: 34634726 DOI: 10.1016/j.adolescence.2021.09.011] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 08/04/2021] [Accepted: 09/29/2021] [Indexed: 10/20/2022]

Abstract

INTRODUCTION

Changes in civic purpose during the emerging adulthood has been a significant research topic since it is closely associated with active civic engagement later in human lives. While standard regression methods have been used in previous studies to predict civic purpose development, they have limitations that may not always lead to best prediction models. We aimed to address these limitations by utilizing elastic-net multinomial logistic regression, which favors models with the least number of necessary predictors, in exploration of predictors for civic purpose development in a data-driven manner.

METHODS

We analyzed data from the longitudinal Civic Purpose Project while focusing on the model that best predicted civic purpose from Wave 1 (12th grade before high school graduation) to Wave 2 (two years after Wave 1). The reanalyzed data included responses from 476 participants (60.29% females, 39.08% males) who were recruited from Californian high schools in the United States and completed the survey at both Waves. The elastic-net regression was performed 5000 times for predicting three dependent variables, Wave 2 political purpose, community service purpose, and expressive activity purpose, with Wave 1 predictors. We identified which predictors were selected as the constituents of the best regression models during the elastic-net regression process.

RESULTS

Results showed that civic purpose, moral and political identity, and external supports (e.g., parental and peer involvement, school civic opportunities, etc.) in Wave 1 significantly predicted civic purpose in Wave 2. Several predictors were excluded from the regression models during the elastic-net regression process.

CONCLUSION

We found that the elastic-net regression was able to present the more regularized model for prediction. Implications for promoting civic purpose are discussed as well as utilizing the elastic-net regression method.

Collapse

Duan R, Gao L, Gao Y, Hu Y, Xu H, Huang M, Song K, Wang H, Dong Y, Jiang C, Zhang C, Jia S. Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Comput Biol 2021;17:e1009224. [PMID: 34383739 PMCID: PMC8384175 DOI: 10.1371/journal.pcbi.1009224] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Revised: 08/24/2021] [Accepted: 06/28/2021] [Indexed: 11/18/2022] Open

Abstract

Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis.

Cancer is one of the most heterogeneous diseases, characterized by diverse morphological, phenotypic, and genomic profiles between tumors and their subtypes. Identifying cancer subtypes can help patients receive precise treatments. With the development of high-throughput technologies, genomics, epigenomics, and transcriptomics data have been generated for large cancer patient cohorts. It is believed that the more omics data we use, the more accurate identification of cancer subtypes. To examine this assumption, we first constructed three classes of benchmarking datasets to conduct a comprehensive evaluation and comparison of ten representative multi-omics data integration methods for cancer subtyping by considering their accuracy, robustness, and computational efficiency. Then, we investigated the influence of different omics data and their various combinations on the effectiveness of cancer subtyping. Our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. We hope that our work may help researchers choose a proper method and an effective data combination when identifying cancer subtypes using data integration methods.

Collapse

Zeng C, Thomas DC, Lewinger JP. Incorporating prior knowledge into regularized regression. Bioinformatics 2021;37:514-521. [PMID: 32915960 PMCID: PMC8599719 DOI: 10.1093/bioinformatics/btaa776] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 08/13/2020] [Accepted: 09/01/2020] [Indexed: 01/15/2023] Open

Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. NATURE COMPUTATIONAL SCIENCE 2021;1:395-402. [PMID: 38217236 DOI: 10.1038/s43588-021-00086-z] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 05/17/2021] [Indexed: 01/15/2024]

Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021;22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open

van de Wiel MA, van Nee MM, Rauschenberger A. Fast Cross-validation for Multi-penalty High-dimensional Ridge Regression. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1904962] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]

Magazzù G, Zampieri G, Angione C. Multimodal regularised linear models with flux balance analysis for mechanistic integration of omics data. Bioinformatics 2021;37:3546-3552. [PMID: 33974036 DOI: 10.1093/bioinformatics/btab324] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 01/06/2021] [Accepted: 04/27/2021] [Indexed: 12/13/2022] Open

Zhao L, Dong Q, Luo C, Wu Y, Bu D, Qi X, Luo Y, Zhao Y. DeepOmix: A scalable and interpretable multi-omics deep learning framework and application in cancer survival analysis. Comput Struct Biotechnol J 2021;19:2719-2725. [PMID: 34093987 PMCID: PMC8131983 DOI: 10.1016/j.csbj.2021.04.067] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 04/26/2021] [Accepted: 04/27/2021] [Indexed: 01/23/2023] Open

Cao W, Luo C, Lei M, Shen M, Ding W, Wang M, Song M, Ge J, Zhang Q. Development and Validation of a Dynamic Nomogram to Predict the Risk of Neonatal White Matter Damage. Front Hum Neurosci 2021;14:584236. [PMID: 33708079 PMCID: PMC7940363 DOI: 10.3389/fnhum.2020.584236] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Accepted: 12/31/2020] [Indexed: 12/23/2022] Open

Abstract

Purpose

White matter damage (WMD) was defined as the appearance of rough and uneven echo enhancement in the white matter around the ventricle. The aim of this study was to develop and validate a risk prediction model for neonatal WMD.

Materials and Methods

We collected data for 1,733 infants hospitalized at the Department of Neonatology at The First Affiliated Hospital of Zhengzhou University from 2017 to 2020. Infants were randomly assigned to training (n = 1,216) or validation (n = 517) cohorts at a ratio of 7:3. Multivariate logistic regression and least absolute shrinkage and selection operator (LASSO) regression analyses were used to establish a risk prediction model and web-based risk calculator based on the training cohort data. The predictive accuracy of the model was verified in the validation cohort.

Results

We identified four variables as independent risk factors for brain WMD in neonates by multivariate logistic regression and LASSO analysis, including gestational age, fetal distress, prelabor rupture of membranes, and use of corticosteroids. These were used to establish a risk prediction nomogram and web-based calculator (https://caowenjun.shinyapps.io/dynnomapp/). The C-index of the training and validation sets was 0.898 (95% confidence interval: 0.8745-0.9215) and 0.887 (95% confidence interval: 0.8478-0.9262), respectively. Decision tree analysis showed that the model was highly effective in the threshold range of 1-61%. The sensitivity and specificity of the model were 82.5 and 81.7%, respectively, and the cutoff value was 0.099.

Conclusion

This is the first study describing the use of a nomogram and web-based calculator to predict the risk of WMD in neonates. The web-based calculator increases the applicability of the predictive model and is a convenient tool for doctors at primary hospitals and outpatient clinics, family doctors, and even parents to identify high-risk births early on and implementing appropriate interventions while avoiding excessive treatment of low-risk patients.

Collapse

Krautenbacher N, Kabesch M, Horak E, Braun-Fahrländer C, Genuneit J, Boznanski A, von Mutius E, Theis F, Fuchs C, Ege MJ. Asthma in farm children is more determined by genetic polymorphisms and in non-farm children by environmental factors. Pediatr Allergy Immunol 2021;32:295-304. [PMID: 32997854 DOI: 10.1111/pai.13385] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/22/2020] [Accepted: 09/23/2020] [Indexed: 01/06/2023]

Affiliation(s)

Norbert Krautenbacher Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.,Center for Mathematics, Chair of Mathematical Modeling of Biological Systems, Technische Universität München, Garching, Germany
Michael Kabesch University Children's Hospital Regensburg (KUNO), Regensburg, Germany.,Clinic for Pediatric Pneumology and Neonatology, Hannover Medical School, Hannover, Germany.,The German Center for Lung Research (DZL), Germany
Elisabeth Horak Department of Pediatrics and Adolescents, Innsbruck Medical University, Innsbruck, Austria
Charlotte Braun-Fahrländer Swiss Tropical and Public Health Institute Basel, Basel, Switzerland.,University of Basel, Basel, Switzerland
Jon Genuneit Institute of Epidemiology and Medical Biometry, Ulm University, Ulm, Germany.,Pediatric Epidemiology, Department of Pediatrics, Medical Faculty, Leipzig University, Leipzig, Germany
Andrzej Boznanski Wroclaw Medical University, Wroclaw, Poland
Erika von Mutius The German Center for Lung Research (DZL), Germany.,Dr von Hauner Children's Hospital, LMU Munich, Munich, Germany.,Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Asthma and Allergy Prevention, Neuherberg, Germany
Fabian Theis Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.,Center for Mathematics, Chair of Mathematical Modeling of Biological Systems, Technische Universität München, Garching, Germany
Christiane Fuchs Helmholtz Zentrum München, German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.,Center for Mathematics, Chair of Mathematical Modeling of Biological Systems, Technische Universität München, Garching, Germany.,Department of Business Administration and Economics, Bielefeld University, Bielefeld, Germany
Markus J Ege The German Center for Lung Research (DZL), Germany.,Dr von Hauner Children's Hospital, LMU Munich, Munich, Germany

Collapse

Wu M, Jiang Y, Ma S. Integration of Proteomics and Other Omics Data. Methods Mol Biol 2021;2361:307-324. [PMID: 34236669 DOI: 10.1007/978-1-0716-1641-3_18] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Mackay IJ, Cockram J, Howell P, Powell W. Understanding the classics: the unifying concepts of transgressive segregation, inbreeding depression and heterosis and their central relevance for crop breeding. PLANT BIOTECHNOLOGY JOURNAL 2021;19:26-34. [PMID: 32996672 PMCID: PMC7769232 DOI: 10.1111/pbi.13481] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 09/07/2020] [Accepted: 09/12/2020] [Indexed: 05/12/2023]

Klosa J, Simon N, Westermark PO, Liebscher V, Wittenburg D. Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient descent. BMC Bioinformatics 2020;21:407. [PMID: 32933477 PMCID: PMC7493359 DOI: 10.1186/s12859-020-03725-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 08/31/2020] [Indexed: 11/15/2022] Open

Herrmann M, Probst P, Hornung R, Jurinovic V, Boulesteix AL. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform 2020;22:5895463. [PMID: 32823283 PMCID: PMC8138887 DOI: 10.1093/bib/bbaa167] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 06/25/2020] [Accepted: 07/03/2020] [Indexed: 12/18/2022] Open

Abstract

Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.

Collapse

Belhechmi S, Bin RD, Rotolo F, Michiels S. Accounting for grouped predictor variables or pathways in high-dimensional penalized Cox regression models. BMC Bioinformatics 2020;21:277. [PMID: 32615919 PMCID: PMC7331150 DOI: 10.1186/s12859-020-03618-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 06/19/2020] [Indexed: 12/28/2022] Open

Abstract

BACKGROUND

The standard lasso penalty and its extensions are commonly used to develop a regularized regression model while selecting candidate predictor variables on a time-to-event outcome in high-dimensional data. However, these selection methods focus on a homogeneous set of variables and do not take into account the case of predictors belonging to functional groups; typically, genomic data can be grouped according to biological pathways or to different types of collected data. Another challenge is that the standard lasso penalisation is known to have a high false discovery rate.

RESULTS

We evaluated different penalizations in a Cox model to select grouped variables in order to further penalize variables that, in addition to having a low effect, belong to a group with a low overall effect; and to favor the selection of variables that, in addition to having a large effect, belong to a group with a large overall effect. We considered the case of prespecified and disjoint groups and proposed diverse weights for the adaptive lasso method. In particular we proposed the product Max Single Wald by Single Wald weighting (MSW*SW) which takes into account the information of the group to which it belongs and of this biomarker. Through simulations, we compared the selection and prediction ability of our approach with the standard lasso, the composite Minimax Concave Penalty (cMCP), the group exponential lasso (gel), the Integrative L1-Penalized Regression with Penalty Factors (IPF-Lasso), and the Sparse Group Lasso (SGL) methods. In addition, we illustrated the methods using gene expression data of 614 breast cancer patients.

CONCLUSIONS

The adaptive lasso with the MSW*SW weighting method incorporates both the information in the grouping structure and the individual variable. It outperformed the competitors by reducing the false discovery rate without severely increasing the false negative rate.

Collapse

Shi WJ, Zhuang Y, Russell PH, Hobbs BD, Parker MM, Castaldi PJ, Rudra P, Vestal B, Hersh CP, Saba LM, Kechris K. Unsupervised discovery of phenotype-specific multi-omics networks. Bioinformatics 2020;35:4336-4343. [PMID: 30957844 DOI: 10.1093/bioinformatics/btz226] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 02/01/2019] [Accepted: 04/05/2019] [Indexed: 12/15/2022] Open

Abstract

MOTIVATION

Complex diseases often involve a wide spectrum of phenotypic traits. Better understanding of the biological mechanisms relevant to each trait promotes understanding of the etiology of the disease and the potential for targeted and effective treatment plans. There have been many efforts towards omics data integration and network reconstruction, but limited work has examined the incorporation of relevant (quantitative) phenotypic traits.

RESULTS

We propose a novel technique, sparse multiple canonical correlation network analysis (SmCCNet), for integrating multiple omics data types along with a quantitative phenotype of interest, and for constructing multi-omics networks that are specific to the phenotype. As a case study, we focus on miRNA-mRNA networks. Through simulations, we demonstrate that SmCCNet has better overall prediction performance compared to popular gene expression network construction and integration approaches under realistic settings. Applying SmCCNet to studies on chronic obstructive pulmonary disease (COPD) and breast cancer, we found enrichment of known relevant pathways (e.g. the Cadherin pathway for COPD and the interferon-gamma signaling pathway for breast cancer) as well as less known omics features that may be important to the diseases. Although those applications focus on miRNA-mRNA co-expression networks, SmCCNet is applicable to a variety of omics and other data types. It can also be easily generalized to incorporate multiple quantitative phenotype simultaneously. The versatility of SmCCNet suggests great potential of the approach in many areas.

AVAILABILITY AND IMPLEMENTATION

The SmCCNet algorithm is written in R, and is freely available on the web at https://cran.r-project.org/web/packages/SmCCNet/index.html.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Zhang X, de Leon J, Crespo-Facorro B, Diaz FJ. Measuring individual benefits of psychiatric treatment using longitudinal binary outcomes: Application to antipsychotic benefits in non-cannabis and cannabis users. J Biopharm Stat 2020;30:916-940. [DOI: 10.1080/10543406.2020.1765371] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]

Zhao Z, Zucknick M. Structured penalized regression for drug sensitivity prediction. J R Stat Soc Ser C Appl Stat 2020. [DOI: 10.1111/rssc.12400] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]

Oh M, Park S, Kim S, Chae H. Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations. Brief Bioinform 2020;22:66-76. [PMID: 32227074 DOI: 10.1093/bib/bbaa032] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Revised: 02/05/2020] [Accepted: 02/25/2020] [Indexed: 02/06/2023] Open

Jagdhuber R, Lang M, Stenzl A, Neuhaus J, Rahnenführer J. Cost-Constrained feature selection in binary classification: adaptations for greedy forward selection and genetic algorithms. BMC Bioinformatics 2020;21:26. [PMID: 31992203 PMCID: PMC6986087 DOI: 10.1186/s12859-020-3361-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 01/10/2020] [Indexed: 01/22/2023] Open

Abstract

BACKGROUND

With modern methods in biotechnology, the search for biomarkers has advanced to a challenging statistical task exploring high dimensional data sets. Feature selection is a widely researched preprocessing step to handle huge numbers of biomarker candidates and has special importance for the analysis of biomedical data. Such data sets often include many input features not related to the diagnostic or therapeutic target variable. A less researched, but also relevant aspect for medical applications are costs of different biomarker candidates. These costs are often financial costs, but can also refer to other aspects, for example the decision between a painful biopsy marker and a simple urine test. In this paper, we propose extensions to two feature selection methods to control the total amount of such costs: greedy forward selection and genetic algorithms. In comprehensive simulation studies of binary classification tasks, we compare the predictive performance, the run-time and the detection rate of relevant features for the new proposed methods and five baseline alternatives to handle budget constraints.

RESULTS

In simulations with a predefined budget constraint, our proposed methods outperform the baseline alternatives, with just minor differences between them. Only in the scenario without an actual budget constraint, our adapted greedy forward selection approach showed a clear drop in performance compared to the other methods. However, introducing a hyperparameter to adapt the benefit-cost trade-off in this method could overcome this weakness.

CONCLUSIONS

In feature cost scenarios, where a total budget has to be met, common feature selection algorithms are often not suitable to identify well performing subsets for a modelling task. Adaptations of these algorithms such as the ones proposed in this paper can help to tackle this problem.

Collapse

Sparse classification with paired covariates. ADV DATA ANAL CLASSI 2019. [DOI: 10.1007/s11634-019-00375-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Chase EC, Boonstra PS. Accounting for established predictors with the multistep elastic net. Stat Med 2019;38:4534-4544. [PMID: 31313344 DOI: 10.1002/sim.8313] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Revised: 04/27/2019] [Accepted: 06/17/2019] [Indexed: 12/17/2022]

Velten B, Huber W. Adaptive penalization in high-dimensional regression and classification with external covariates using variational Bayes. Biostatistics 2019;22:348-364. [PMID: 31596468 PMCID: PMC8036004 DOI: 10.1093/biostatistics/kxz034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Revised: 06/27/2019] [Accepted: 08/14/2019] [Indexed: 12/18/2022] Open

Richter J, Madjar K, Rahnenführer J. Model-based optimization of subgroup weights for survival analysis. Bioinformatics 2019;35:i484-i491. [PMID: 31510644 PMCID: PMC6612842 DOI: 10.1093/bioinformatics/btz361] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open

Krautenbacher N, Flach N, Böck A, Laubhahn K, Laimighofer M, Theis FJ, Ankerst DP, Fuchs C, Schaub B. A strategy for high-dimensional multivariable analysis classifies childhood asthma phenotypes from genetic, immunological, and environmental factors. Allergy 2019;74:1364-1373. [PMID: 30737985 PMCID: PMC6767756 DOI: 10.1111/all.13745] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2018] [Revised: 12/22/2018] [Accepted: 01/06/2019] [Indexed: 12/14/2022]

Abstract

Background

Associations between childhood asthma phenotypes and genetic, immunological, and environmental factors have been previously established. Yet, strategies to integrate high‐dimensional risk factors from multiple distinct data sets, and thereby increase the statistical power of analyses, have been hampered by a preponderance of missing data and lack of methods to accommodate them.

Methods

We assembled questionnaire, diagnostic, genotype, microarray, RT‐qPCR, flow cytometry, and cytokine data (referred to as data modalities) to use as input factors for a classifier that could distinguish healthy children, mild‐to‐moderate allergic asthmatics, and nonallergic asthmatics. Based on data from 260 German children aged 4‐14 from our university outpatient clinic, we built a novel multilevel prediction approach for asthma outcome which could deal with a present complex missing data structure.

Results

The optimal learning method was boosting based on all data sets, achieving an area underneath the receiver operating characteristic curve (AUC) for three classes of phenotypes of 0.81 (95%‐confidence interval (CI): 0.65‐0.94) using leave‐one‐out cross‐validation. Besides improving the AUC, our integrative multilevel learning approach led to tighter CIs than using smaller complete predictor data sets (AUC = 0.82 [0.66‐0.94] for boosting). The most important variables for classifying childhood asthma phenotypes comprised novel identified genes, namely PKN2 (protein kinase N2), PTK2 (protein tyrosine kinase 2), and ALPP (alkaline phosphatase, placental).

Conclusion

Our combination of several data modalities using a novel strategy improved classification of childhood asthma phenotypes but requires validation in external populations. The generic approach is applicable to other multilevel data‐based risk prediction settings, which typically suffer from incomplete data.

Collapse

Affiliation(s)

Norbert Krautenbacher Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
Nicolai Flach Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
Andreas Böck Department of Pulmonary and Allergy Dr. von Hauner Children's Hospital LMU Munich Germany
Kristina Laubhahn Department of Pulmonary and Allergy Dr. von Hauner Children's Hospital LMU Munich Germany Member of German Lung Centre (DZL) CPC Munich Germany
Michael Laimighofer Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
Fabian J. Theis Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany
Donna P. Ankerst Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany University of Texas Health Science Center at San Antonio San Antonio Texas
Christiane Fuchs Institute of Computational Biology Helmholtz Zentrum München German Research Center for Environmental Health GmbH Neuherberg Germany Technische Universität München Center for Mathematics Chair of Mathematical Modeling of Biological Systems Garching Germany Faculty of Business Administration and Economics Bielefeld University Bielefeld Germany
Bianca Schaub Department of Pulmonary and Allergy Dr. von Hauner Children's Hospital LMU Munich Germany Member of German Lung Centre (DZL) CPC Munich Germany

Collapse

Hornung R, Wright MN. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics 2019;20:358. [PMID: 31248362 PMCID: PMC6598279 DOI: 10.1186/s12859-019-2942-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 06/07/2019] [Indexed: 12/25/2022] Open

Abstract

Background

In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available.

Results

We identify one variant termed “block forest” that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application.

Conclusions

The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.

Electronic supplementary material

The online version of this article (10.1186/s12859-019-2942-y) contains supplementary material, which is available to authorized users.

Collapse

López de Maturana E, Alonso L, Alarcón P, Martín-Antoniano IA, Pineda S, Piorno L, Calle ML, Malats N. Challenges in the Integration of Omics and Non-Omics Data. Genes (Basel) 2019;10:genes10030238. [PMID: 30897838 PMCID: PMC6471713 DOI: 10.3390/genes10030238] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Revised: 03/05/2019] [Accepted: 03/14/2019] [Indexed: 11/16/2022] Open

van de Wiel MA, Te Beest DE, Münch MM. Learning from a lot: Empirical Bayes for high-dimensional model-based prediction. Scand Stat Theory Appl 2019;46:2-25. [PMID: 31007342 PMCID: PMC6472625 DOI: 10.1111/sjos.12335] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Revised: 01/24/2018] [Accepted: 03/22/2018] [Indexed: 12/21/2022]

Chauvel C, Novoloaca A, Veyre P, Reynier F, Becker J. Evaluation of integrative clustering methods for the analysis of multi-omics data. Brief Bioinform 2019;21:541-552. [DOI: 10.1093/bib/bbz015] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 01/12/2019] [Accepted: 01/16/2019] [Indexed: 12/20/2022] Open

Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinformatics 2018;19:322. [PMID: 30208855 PMCID: PMC6134797 DOI: 10.1186/s12859-018-2344-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 08/29/2018] [Indexed: 12/18/2022] Open