1
|
Li W, Ballard J, Zhao Y, Long Q. Knowledge-guided learning methods for integrative analysis of multi-omics data. Comput Struct Biotechnol J 2024; 23:1945-1950. [PMID: 38736693 PMCID: PMC11087912 DOI: 10.1016/j.csbj.2024.04.053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/14/2024] Open
Abstract
Integrative analysis of multi-omics data has the potential to yield valuable and comprehensive insights into the molecular mechanisms underlying complex diseases such as cancer and Alzheimer's disease. However, a number of analytical challenges complicate multi-omics data integration. For instance, -omics data are usually high-dimensional, and sample sizes in multi-omics studies tend to be modest. Furthermore, when genes in an important pathway have relatively weak signal, it can be difficult to detect them individually. There is a growing body of literature on knowledge-guided learning methods that can address these challenges by incorporating biological knowledge such as functional genomics and functional proteomics into multi-omics data analysis. These methods have been shown to outperform their counterparts that do not utilize biological knowledge in tasks including prediction, feature selection, clustering, and dimension reduction. In this review, we survey recently developed methods and applications of knowledge-guided multi-omics data integration methods and discuss future research directions.
Collapse
Affiliation(s)
- Wenrui Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, 19104, PA, USA
| | - Jenna Ballard
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, 19104, PA, USA
| | - Yize Zhao
- Department of Biostatistics, School of Public Health, Yale University, 60 College Street, New Haven, 06510, CT, USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, 19104, PA, USA
| |
Collapse
|
2
|
Li W, Chang C, Kundu S, Long Q. Accounting for network noise in graph-guided Bayesian modeling of structured high-dimensional data. Biometrics 2024; 80:ujae012. [PMID: 38483282 PMCID: PMC10938547 DOI: 10.1093/biomtc/ujae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 12/31/2023] [Accepted: 02/14/2024] [Indexed: 03/17/2024]
Abstract
There is a growing body of literature on knowledge-guided statistical learning methods for analysis of structured high-dimensional data (such as genomic and transcriptomic data) that can incorporate knowledge of underlying networks derived from functional genomics and functional proteomics. These methods have been shown to improve variable selection and prediction accuracy and yield more interpretable results. However, these methods typically use graphs extracted from existing databases or rely on subject matter expertise, which are known to be incomplete and may contain false edges. To address this gap, we propose a graph-guided Bayesian modeling framework to account for network noise in regression models involving structured high-dimensional predictors. Specifically, we use 2 sources of network information, including the noisy graph extracted from existing databases and the estimated graph from observed predictors in the dataset at hand, to inform the model for the true underlying network via a latent scale modeling framework. This model is coupled with the Bayesian regression model with structured high-dimensional predictors involving an adaptive structured shrinkage prior. We develop an efficient Markov chain Monte Carlo algorithm for posterior sampling. We demonstrate the advantages of our method over existing methods in simulations, and through analyses of a genomics dataset and another proteomics dataset for Alzheimer's disease.
Collapse
Affiliation(s)
- Wenrui Li
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, PA 19104, United States
| | - Changgee Chang
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - Suprateek Kundu
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, PA 19104, United States
| |
Collapse
|
3
|
Zhang Q, Chang C, Shen L, Long Q. Incorporating graph information in Bayesian factor analysis with robust and adaptive shrinkage priors. Biometrics 2024; 80:ujad014. [PMID: 38281768 PMCID: PMC10826885 DOI: 10.1093/biomtc/ujad014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 10/20/2023] [Accepted: 11/16/2023] [Indexed: 01/30/2024]
Abstract
There has been an increasing interest in decomposing high-dimensional multi-omics data into a product of low-rank and sparse matrices for the purpose of dimension reduction and feature engineering. Bayesian factor models achieve such low-dimensional representation of the original data through different sparsity-inducing priors. However, few of these models can efficiently incorporate the information encoded by the biological graphs, which has been already proven to be useful in many analysis tasks. In this work, we propose a Bayesian factor model with novel hierarchical priors, which incorporate the biological graph knowledge as a tool of identifying a group of genes functioning collaboratively. The proposed model therefore enables sparsity within networks by allowing each factor loading to be shrunk adaptively and by considering additional layers to relate individual shrinkage parameters to the underlying graph information, both of which yield a more accurate structure recovery of factor loadings. Further, this new priors overcome the phase transition phenomenon, in contrast to existing graph-incorporated approaches, so that it is robust to noisy edges that are inconsistent with the actual sparsity structure of the factor loadings. Finally, our model can handle both continuous and discrete data types. The proposed method is shown to outperform several existing factor analysis methods through simulation experiments and real data analyses.
Collapse
Affiliation(s)
- Qiyiwen Zhang
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Changgee Chang
- Department of Biostatistics and Health Data Science, Indiana University School of Medicine, Indianapolis, IN 47405, United States
| | - Li Shen
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Qi Long
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States
| |
Collapse
|
4
|
Chang C, Dai Z, Oh J, Long Q. Integrative Learning of Structured High-Dimensional Data from Multiple Datasets. Stat Anal Data Min 2023; 16:120-134. [PMID: 37213790 PMCID: PMC10195070 DOI: 10.1002/sam.11601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 10/14/2022] [Indexed: 11/11/2022]
Abstract
Integrative learning of multiple datasets has the potential to mitigate the challenge of small n and large p that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.
Collapse
Affiliation(s)
- Changgee Chang
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| | - Zongyu Dai
- School of Arts and Science, University of Pennsylvania, Pennsylvania, U.S.A
| | - Jihwan Oh
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| | - Qi Long
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| |
Collapse
|
5
|
Sun W, Chang C, Long Q. Graph-guided Bayesian SVM with Adaptive Structured Shrinkage Prior for High-dimensional Data. PROCEEDINGS : ... IEEE INTERNATIONAL CONFERENCE ON BIG DATA. IEEE INTERNATIONAL CONFERENCE ON BIG DATA 2021; 2021:4472-4479. [PMID: 35187547 PMCID: PMC8855458 DOI: 10.1109/bigdata52589.2021.9671712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Support vector machine (SVM) is a popular classification method for the analysis of a wide range of data including big biomedical data. Many SVM methods with feature selection have been developed under the frequentist regularization or Bayesian shrinkage frameworks. On the other hand, the value of incorporating a priori known biological knowledge, such as those from functional genomics and functional proteomics, into statistical analysis of -omic data has been recognized in recent years. Such biological information is often represented by graphs. We propose a novel method that assigns Laplace priors to the regression coefficients and incorporates the underlying graph information via a hyper-prior for the shrinkage parameters in the Laplace priors. This enables smoothing of shrinkage parameters for connected variables in the graph and conditional independence between shrinkage parameters for disconnected variables. Extensive simulations demonstrate that our proposed methods achieve the best performance compared to the other existing SVM methods in terms of prediction accuracy. The proposed method are also illustrated in analysis of genomic data from cancer studies, demonstrating its advantage in generating biologically meaningful results and identifying potentially important features.
Collapse
Affiliation(s)
- Wenli Sun
- Dept of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, USA
| | - Changgee Chang
- Dept of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, USA
| | - Qi Long
- Dept of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, USA
| |
Collapse
|
6
|
Peng Q, Weng K, Li S, Xu R, Wang Y, Wu Y. A Perspective of Epigenetic Regulation in Radiotherapy. Front Cell Dev Biol 2021; 9:624312. [PMID: 33681204 PMCID: PMC7930394 DOI: 10.3389/fcell.2021.624312] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 01/28/2021] [Indexed: 12/17/2022] Open
Abstract
Radiation therapy (RT) has been employed as a tumoricidal modality for more than 100 years and on 470,000 patients each year in the United States. The ionizing radiation causes genetic changes and results in cell death. However, since the biological mechanism of radiation remains unclear, there is a pressing need to understand this mechanism to improve the killing effect on tumors and reduce the side effects on normal cells. DNA break and epigenetic remodeling can be induced by radiotherapy. Hence the modulation of histone modification enzymes may tune the radiosensitivity of cancer cells. For instance, histone deacetylase (HDAC) inhibitors sensitize irradiated cancer cells by amplifying the DNA damage signaling and inhibiting double-strand DNA break repair to influence the irradiated cells’ survival. However, the combination of epigenetic drugs and radiotherapy has only been evaluated in several ongoing clinical trials for limited cancer types, partly due to a lack of knowledge on the potential mechanisms on how radiation induces epigenetic regulation and chromatin remodeling. Here, we review recent advances of radiotherapy and radiotherapy-induced epigenetic remodeling and introduce related technologies for epigenetic monitoring. Particularly, we exploit the application of fluorescence resonance energy transfer (FRET) biosensors to visualize dynamic epigenetic regulations in single living cells and tissue upon radiotherapy and drug treatment. We aim to bridge FRET biosensor, epigenetics, and radiotherapy, providing a perspective of using FRET to assess epigenetics and provide guidance for radiotherapy to improve cancer treatment. In the end, we discuss the feasibility of a combination of epigenetic drugs and radiotherapy as new approaches for cancer therapeutics.
Collapse
Affiliation(s)
- Qin Peng
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, China.,Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States.,Institute of Engineering in Medicine, University of California, San Diego, La Jolla, CA, United States
| | - Kegui Weng
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States.,Institute of Engineering in Medicine, University of California, San Diego, La Jolla, CA, United States.,Chongqing Cancer Hospital, Chongqing Cancer Institute, Chongqing University Cancer Hospital, Chongqing, China
| | - Shitian Li
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States.,Institute of Engineering in Medicine, University of California, San Diego, La Jolla, CA, United States
| | - Richard Xu
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States.,Institute of Engineering in Medicine, University of California, San Diego, La Jolla, CA, United States
| | - Yingxiao Wang
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, United States.,Institute of Engineering in Medicine, University of California, San Diego, La Jolla, CA, United States
| | - Yongzhong Wu
- Chongqing Cancer Hospital, Chongqing Cancer Institute, Chongqing University Cancer Hospital, Chongqing, China
| |
Collapse
|
7
|
Application of Systems Engineering Principles and Techniques in Biological Big Data Analytics: A Review. Processes (Basel) 2020. [DOI: 10.3390/pr8080951] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
In the past few decades, we have witnessed tremendous advancements in biology, life sciences and healthcare. These advancements are due in no small part to the big data made available by various high-throughput technologies, the ever-advancing computing power, and the algorithmic advancements in machine learning. Specifically, big data analytics such as statistical and machine learning has become an essential tool in these rapidly developing fields. As a result, the subject has drawn increased attention and many review papers have been published in just the past few years on the subject. Different from all existing reviews, this work focuses on the application of systems, engineering principles and techniques in addressing some of the common challenges in big data analytics for biological, biomedical and healthcare applications. Specifically, this review focuses on the following three key areas in biological big data analytics where systems engineering principles and techniques have been playing important roles: the principle of parsimony in addressing overfitting, the dynamic analysis of biological data, and the role of domain knowledge in biological data analytics.
Collapse
|