1
|
Chen F, Zou G, Wu Y, Ou-Yang L. Clustering single-cell multi-omics data via graph regularized multi-view ensemble learning. Bioinformatics 2024; 40:btae169. [PMID: 38547401 PMCID: PMC11015955 DOI: 10.1093/bioinformatics/btae169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 02/21/2024] [Accepted: 03/26/2024] [Indexed: 04/15/2024] Open
Abstract
MOTIVATION Single-cell clustering plays a crucial role in distinguishing between cell types, facilitating the analysis of cell heterogeneity mechanisms. While many existing clustering methods rely solely on gene expression data obtained from single-cell RNA sequencing techniques to identify cell clusters, the information contained in mono-omic data is often limited, leading to suboptimal clustering performance. The emergence of single-cell multi-omics sequencing technologies enables the integration of multiple omics data for identifying cell clusters, but how to integrate different omics data effectively remains challenging. In addition, designing a clustering method that performs well across various types of multi-omics data poses a persistent challenge due to the data's inherent characteristics. RESULTS In this paper, we propose a graph-regularized multi-view ensemble clustering (GRMEC-SC) model for single-cell clustering. Our proposed approach can adaptively integrate multiple omics data and leverage insights from multiple base clustering results. We extensively evaluate our method on five multi-omics datasets through a series of rigorous experiments. The results of these experiments demonstrate that our GRMEC-SC model achieves competitive performance across diverse multi-omics datasets with varying characteristics. AVAILABILITY AND IMPLEMENTATION Implementation of GRMEC-SC, along with examples, can be found on the GitHub repository: https://github.com/polarisChen/GRMEC-SC.
Collapse
Affiliation(s)
- Fuqun Chen
- College of Electronic and Information Engineering, Shenzhen University, Shenzhen 518060, Guangdong, China
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, Guangdong, China
- Shenzhen Key Laboratory of Media Security and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, Guangdong, China
| | - Guanhua Zou
- College of Electronic and Information Engineering, Shenzhen University, Shenzhen 518060, Guangdong, China
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, Guangdong, China
- Shenzhen Key Laboratory of Media Security and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, Guangdong, China
| | - Yongxian Wu
- College of Electronic and Information Engineering, Shenzhen University, Shenzhen 518060, Guangdong, China
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, Guangdong, China
- Shenzhen Key Laboratory of Media Security and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, Guangdong, China
| | - Le Ou-Yang
- College of Electronic and Information Engineering, Shenzhen University, Shenzhen 518060, Guangdong, China
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, Guangdong, China
- Shenzhen Key Laboratory of Media Security and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, Guangdong, China
| |
Collapse
|
2
|
Liu J, Yan C, Yu Y, Lu C, Huang J, Ou-Yang L, Zhao P. MARS: a motif-based autoregressive model for retrosynthesis prediction. Bioinformatics 2024; 40:btae115. [PMID: 38426338 PMCID: PMC10948277 DOI: 10.1093/bioinformatics/btae115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 01/30/2024] [Accepted: 02/27/2024] [Indexed: 03/02/2024] Open
Abstract
MOTIVATION Retrosynthesis is a critical task in drug discovery, aimed at finding a viable pathway for synthesizing a given target molecule. Many existing approaches frame this task as a graph-generating problem. Specifically, these methods first identify the reaction center, and break a targeted molecule accordingly to generate the synthons. Reactants are generated by either adding atoms sequentially to synthon graphs or by directly adding appropriate leaving groups. However, both of these strategies have limitations. Adding atoms results in a long prediction sequence that increases the complexity of generation, while adding leaving groups only considers those in the training set, which leads to poor generalization. RESULTS In this paper, we propose a novel end-to-end graph generation model for retrosynthesis prediction, which sequentially identifies the reaction center, generates the synthons, and adds motifs to the synthons to generate reactants. Given that chemically meaningful motifs fall between the size of atoms and leaving groups, our model achieves lower prediction complexity than adding atoms and demonstrates superior performance than adding leaving groups. We evaluate our proposed model on a benchmark dataset and show that it significantly outperforms previous state-of-the-art models. Furthermore, we conduct ablation studies to investigate the contribution of each component of our proposed model to the overall performance on benchmark datasets. Experiment results demonstrate the effectiveness of our model in predicting retrosynthesis pathways and suggest its potential as a valuable tool in drug discovery. AVAILABILITY AND IMPLEMENTATION All code and data are available at https://github.com/szu-ljh2020/MARS.
Collapse
Affiliation(s)
- Jiahan Liu
- College of Electronic and Information Engineering, Shenzhen University, Shenzhen 518060, Guangdong, China
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, Guangdong, China
- Shenzhen Key Laboratory of Media Security and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, Guangdong, China
| | - Chaochao Yan
- Computer Science and Engineering Department, University of Texas at Artlington, Arlington 76019, TX, United States
| | - Yang Yu
- Tencent AI Lab, Shenzhen 518057, Guangdong, China
| | - Chan Lu
- Tencent AI Lab, Shenzhen 518057, Guangdong, China
| | - Junzhou Huang
- Computer Science and Engineering Department, University of Texas at Artlington, Arlington 76019, TX, United States
| | - Le Ou-Yang
- College of Electronic and Information Engineering, Shenzhen University, Shenzhen 518060, Guangdong, China
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, Guangdong, China
- Shenzhen Key Laboratory of Media Security and Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, Guangdong, China
| | - Peilin Zhao
- Tencent AI Lab, Shenzhen 518057, Guangdong, China
| |
Collapse
|
3
|
He Z, Hu S, Chen Y, An S, Zhou J, Liu R, Shi J, Wang J, Dong G, Shi J, Zhao J, Ou-Yang L, Zhu Y, Bo X, Ying X. Mosaic integration and knowledge transfer of single-cell multimodal data with MIDAS. Nat Biotechnol 2024:10.1038/s41587-023-02040-y. [PMID: 38263515 DOI: 10.1038/s41587-023-02040-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 10/23/2023] [Indexed: 01/25/2024]
Abstract
Integrating single-cell datasets produced by multiple omics technologies is essential for defining cellular heterogeneity. Mosaic integration, in which different datasets share only some of the measured modalities, poses major challenges, particularly regarding modality alignment and batch effect removal. Here, we present a deep probabilistic framework for the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement. We demonstrate its superiority to 19 other methods and reliability by evaluating its performance in trimodal and mosaic integration tasks. We also constructed a single-cell trimodal atlas of human peripheral blood mononuclear cells and tailored transfer learning and reciprocal reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new data. Applications in mosaic integration, pseudotime analysis and cross-tissue knowledge transfer on bone marrow mosaic datasets demonstrate the versatility and superiority of MIDAS. MIDAS is available at https://github.com/labomics/midas .
Collapse
Affiliation(s)
- Zhen He
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Shuofeng Hu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Yaowen Chen
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Sijing An
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jiahao Zhou
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Runyan Liu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Junfeng Shi
- School of Automation, China University of Geosciences, Wuhan, China
| | - Jing Wang
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Guohua Dong
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jinhui Shi
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jiaxin Zhao
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China
| | - Xiaochen Bo
- Institute of Health Service and Transfusion Medicine, Beijing, China.
| | - Xiaomin Ying
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China.
| |
Collapse
|
4
|
Zhan Y, Liu J, Ou-Yang L. scMIC: A Deep Multi-Level Information Fusion Framework for Clustering Single-Cell Multi-Omics Data. IEEE J Biomed Health Inform 2023; 27:6121-6132. [PMID: 37725723 DOI: 10.1109/jbhi.2023.3317272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/21/2023]
Abstract
Cell type identification is a crucial step towards the study of cellular heterogeneity and biological processes. Advances in single-cell sequencing technology have enabled the development of a variety of clustering methods for cell type identification. However, most of existing methods are designed for clustering single omic data such as single-cell RNA-sequencing (scRNA-seq) data. The accumulation of single-cell multi-omics data provides a great opportunity to integrate different omics data for cell clustering, but also raise new computational challenges for existing methods. How to integrate multi-omics data and leverage their consensus and complementary information to improve the accuracy of cell clustering still remains a challenge. In this study, we propose a new deep multi-level information fusion framework, named scMIC, for clustering single-cell multi-omics data. Our model can integrate the attribute information of cells and the potential structural relationship among cells from local and global levels, and reduce redundant information between different omics from cell and feature levels, leading to more discriminative representations. Moreover, the proposed multiple collaborative supervised clustering strategy is able to guide the learning process of the core encoding part by learning the high-confidence target distribution, which facilitates the interaction between the clustering part and the representation learning part, as well as the information exchange between omics, and finally obtain more robust clustering results. Experiments on seven single-cell multi-omics datasets show the superiority of scMIC over existing state-of-the-art methods.
Collapse
|
5
|
Zhan Y, Liu J, Wu M, Tan CSH, Li X, Ou-Yang L. A partially shared joint clustering framework for detecting protein complexes from multiple state-specific signed interaction networks. Comput Biol Med 2023; 159:106936. [PMID: 37105110 DOI: 10.1016/j.compbiomed.2023.106936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 03/27/2023] [Accepted: 04/13/2023] [Indexed: 04/29/2023]
Abstract
Detecting protein complexes is critical for studying cellular organizations and functions. The accumulation of protein-protein interaction (PPI) data enables the identification of protein complexes computationally. Although a great number of computational methods have been proposed to identify protein complexes from PPI networks, most of them ignore the signs of PPIs that reflect the ways proteins interact (activation or inhibition). As not all PPIs imply co-complex relationships, taking into account the signs of PPIs can benefit the identification of protein complexes. Moreover, PPI networks are not static, but vary with the change of cell states or environments. However, existing methods are primarily designed for single-network clustering, and rarely consider joint clustering of multiple PPI networks. In this study, we propose a novel partially shared signed network clustering (PS-SNC) model for identifying protein complexes from multiple state-specific signed PPI networks jointly. PS-SNC can not only consider the signs of PPIs, but also identify the common and unique protein complexes in different states. Experimental results on synthetic and real datasets show that our PS-SNC model can achieve better performance than other state-of-the-art protein complex detection methods. Extensive analysis on real datasets demonstrate the effectiveness of PS-SNC in revealing novel insights about the underlying patterns of different cell lines.
Collapse
Affiliation(s)
- Youlin Zhan
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Jiahan Liu
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Min Wu
- Institute for Infocomm Research (I2R), Agency of Science, Technology, and Research (A*STAR), 138632, Singapore
| | - Chris Soon Heng Tan
- Department of Chemistry, College of Science, Southern University of Science and Technology, Shenzhen, 518055, China
| | - Xiaoli Li
- Institute for Infocomm Research (I2R), Agency of Science, Technology, and Research (A*STAR), 138632, Singapore
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China; Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, 518129, China.
| |
Collapse
|
6
|
Li B, Jin K, Ou-Yang L, Yan H, Zhang XF. scTSSR2: Imputing Dropout Events for Single-Cell RNA Sequencing Using Fast Two-Side Self-Representation. IEEE/ACM Trans Comput Biol Bioinform 2023; 20:1445-1456. [PMID: 35476574 DOI: 10.1109/tcbb.2022.3170587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The single-cell RNA sequencing (scRNA-seq) technique begins a new era by revealing gene expression patterns at single-cell resolution, enabling studies of heterogeneity and transcriptome dynamics of complex tissues at single-cell resolution. However, existing large proportion of dropout events may hinder downstream analyses. Thus imputation of dropout events is an important step in analyzing scRNA-seq data. We develop scTSSR2, a new imputation method that combines matrix decomposition with the previously developed two-side sparse self-representation, leading to fast two-side sparse self-representation to impute dropout events in scRNA-seq data. The comparisons of computational speed and memory usage among different imputation methods show that scTSSR2 has distinct advantages in terms of computational speed and memory usage. Comprehensive downstream experiments show that scTSSR2 outperforms the state-of-the-art imputation methods. A user-friendly R package scTSSR2 is developed to denoise the scRNA-seq data to improve the data quality.
Collapse
|
7
|
Lin Z, Ou-Yang L. Inferring gene regulatory networks from single-cell gene expression data via deep multi-view contrastive learning. Brief Bioinform 2023; 24:6965907. [PMID: 36585783 DOI: 10.1093/bib/bbac586] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 11/28/2022] [Accepted: 11/29/2022] [Indexed: 01/01/2023] Open
Abstract
The inference of gene regulatory networks (GRNs) is of great importance for understanding the complex regulatory mechanisms within cells. The emergence of single-cell RNA-sequencing (scRNA-seq) technologies enables the measure of gene expression levels for individual cells, which promotes the reconstruction of GRNs at single-cell resolution. However, existing network inference methods are mainly designed for data collected from a single data source, which ignores the information provided by multiple related data sources. In this paper, we propose a multi-view contrastive learning (DeepMCL) model to infer GRNs from scRNA-seq data collected from multiple data sources or time points. We first represent each gene pair as a set of histogram images, and then introduce a deep Siamese convolutional neural network with contrastive loss to learn the low-dimensional embedding for each gene pair. Moreover, an attention mechanism is introduced to integrate the embeddings extracted from different data sources and different neighbor gene pairs. Experimental results on synthetic and real-world datasets validate the effectiveness of our contrastive learning and attention mechanisms, demonstrating the effectiveness of our model in integrating multiple data sources for GRN inference.
Collapse
Affiliation(s)
- Zerun Lin
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| |
Collapse
|
8
|
Chen Y, Zhang XF, Ou-Yang L. Inferring cancer common and specific gene networks via multi-layer joint graphical model. Comput Struct Biotechnol J 2023; 21:974-990. [PMID: 36733706 PMCID: PMC9873583 DOI: 10.1016/j.csbj.2023.01.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 01/08/2023] [Accepted: 01/14/2023] [Indexed: 01/19/2023] Open
Abstract
Cancer is a complex disease caused primarily by genetic variants. Reconstructing gene networks within tumors is essential for understanding the functional regulatory mechanisms of carcinogenesis. Advances in high-throughput sequencing technologies have provided tremendous opportunities for inferring gene networks via computational approaches. However, due to the heterogeneity of the same cancer type and the similarities between different cancer types, it remains a challenge to systematically investigate the commonalities and specificities between gene networks of different cancer types, which is a crucial step towards precision cancer diagnosis and treatment. In this study, we propose a new sparse regularized multi-layer decomposition graphical model to jointly estimate the gene networks of multiple cancer types. Our model can handle various types of gene expression data and decomposes each cancer-type-specific network into three components, i.e., globally shared, partially shared and cancer-type-unique components. By identifying the globally and partially shared gene network components, our model can explore the heterogeneous similarities between different cancer types, and our identified cancer-type-unique components can help to reveal the regulatory mechanisms unique to each cancer type. Extensive experiments on synthetic data illustrate the effectiveness of our model in joint estimation of multiple gene networks. We also apply our model to two real data sets to infer the gene networks of multiple cancer subtypes or cell lines. By analyzing our estimated globally shared, partially shared, and cancer-type-unique components, we identified a number of important genes associated with common and specific regulatory mechanisms across different cancer types.
Collapse
Affiliation(s)
- Yuanxiao Chen
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), Shenzhen University, Shenzhen, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), Shenzhen University, Shenzhen, China,Corresponding author.
| |
Collapse
|
9
|
Wu W, Chen Y, Wang R, Ou-Yang L. Self-representative kernel concept factorization. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.110051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
10
|
Wang MG, Ou-Yang L, Yan H, Zhang XF. Inferring Gene Co-Expression Networks by Incorporating Prior Protein-Protein Interaction Networks. IEEE/ACM Trans Comput Biol Bioinform 2022; 19:2894-2906. [PMID: 34383650 DOI: 10.1109/tcbb.2021.3103407] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Inferring gene co-expression networks from high-throughput gene expression data is an important task in bioinformatics. Many gene networks often exhibit modular structures. Although several Gaussian graphical model-based methods have been developed to estimate gene co-expression networks by incorporating the modular structural prior, none of them takes into account the modular structures captured by the prior networks (e.g., protein interaction networks). In this study, we propose a novel prior network-dependent gene network inference (pGNI) method to estimate gene co-expression networks by integrating gene expression data and prior protein interaction network data. The underlying modular structure is learned from both sets of data. Through simulation studies, we demonstrate the feasibility and effectiveness of our method. We also apply our method to two real datasets. The modular structures in the networks estimated by our method are biological significant.
Collapse
|
11
|
Zou G, Lin Y, Han T, Ou-Yang L. DEMOC: a deep embedded multi-omics learning approach for clustering single-cell CITE-seq data. Brief Bioinform 2022; 23:6679449. [PMID: 36047285 DOI: 10.1093/bib/bbac347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Revised: 07/04/2022] [Accepted: 07/26/2022] [Indexed: 11/13/2022] Open
Abstract
Advances in single-cell RNA sequencing (scRNA-seq) technologies has provided an unprecedent opportunity for cell-type identification. As clustering is an effective strategy towards cell-type identification, various computational approaches have been proposed for clustering scRNA-seq data. Recently, with the emergence of cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq), the cell surface expression of specific proteins and the RNA expression on the same cell can be captured, which provides more comprehensive information for cell analysis. However, existing single cell clustering algorithms are mainly designed for single-omic data, and have difficulties in handling multi-omics data with diverse characteristics efficiently. In this study, we propose a novel deep embedded multi-omics clustering with collaborative training (DEMOC) model to perform joint clustering on CITE-seq data. Our model can take into account the characteristics of transcriptomic and proteomic data, and make use of the consistent and complementary information provided by different data sources effectively. Experiment results on two real CITE-seq datasets demonstrate that our DEMOC model not only outperforms state-of-the-art single-omic clustering methods, but also achieves better and more stable performance than existing multi-omics clustering methods. We also apply our model on three scRNA-seq datasets to assess the performance of our model in rare cell-type identification, novel cell-subtype detection and cellular heterogeneity analysis. Experiment results illustrate the effectiveness of our model in discovering the underlying patterns of data.
Collapse
Affiliation(s)
- Guanhua Zou
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Yilong Lin
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Tianyang Han
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China.,Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, 518129, China
| |
Collapse
|
12
|
Zhu Y, Zhang H, Yang Y, Zhang C, Ou-Yang L, Bai L, Deng M, Yi M, Liu S, Wang C. Discovery of pan-cancer related genes via integrative network analysis. Brief Funct Genomics 2022; 21:325-338. [PMID: 35760070 DOI: 10.1093/bfgp/elac012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 05/14/2022] [Accepted: 05/25/2022] [Indexed: 01/02/2023] Open
Abstract
Identification of cancer-related genes is helpful for understanding the pathogenesis of cancer, developing targeted drugs and creating new diagnostic and therapeutic methods. Considering the complexity of the biological laboratory methods, many network-based methods have been proposed to identify cancer-related genes at the global perspective with the increasing availability of high-throughput data. Some studies have focused on the tissue-specific cancer networks. However, cancers from different tissues may share common features, and those methods may ignore the differences and similarities across cancers during the establishment of modeling. In this work, in order to make full use of global information of the network, we first establish the pan-cancer network via differential network algorithm, which not only contains heterogeneous data across multiple cancer types but also contains heterogeneous data between tumor samples and normal samples. Second, the node representation vectors are learned by network embedding. In contrast to ranking analysis-based methods, with the help of integrative network analysis, we transform the cancer-related gene identification problem into a binary classification problem. The final results are obtained via ensemble classification. We further applied these methods to the most commonly used gene expression data involving six tissue-specific cancer types. As a result, an integrative pan-cancer network and several biologically meaningful results were obtained. As examples, nine genes were ultimately identified as potential pan-cancer-related genes. Most of these genes have been reported in published studies, thus showing our method's potential for application in identifying driver gene candidates for further biological experimental verification.
Collapse
Affiliation(s)
- Yuan Zhu
- School of Automation, China University of Geosciences, Lumo Road, 430074, Wuhan, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Lumo Road, 430074, Wuhan, China.,Engineering Research Center of Intelligent Technology for Geo-Exploration, Lumo Road, 430074, Wuhan, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence(Fudan University), Ministry of Education, Handan Road, 200433, Shanghai, China
| | - Houwang Zhang
- Electrical Engineering, City University of HongKong, Kowloon, 999077, HongKong, China
| | - Yuanhang Yang
- School of Mathematics and Physics, China University of Geosciences, Lumo Road, 430074, Wuhan, China
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, The University of Southern Mississippi, Hattiesburg, USA
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, Shenzhen University, Nanhai Avenue, 518060, Shenzhen, China
| | - Litai Bai
- School of Automation, China University of Geosciences, Lumo Road, 430074, Wuhan, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Lumo Road, 430074, Wuhan, China.,Engineering Research Center of Intelligent Technology for Geo-Exploration, Lumo Road, 430074, Wuhan, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, No.5 Yiheyuan Road, 100871, Beijing, China
| | - Ming Yi
- School of Mathematics and Physics, China University of Geosciences, Lumo Road, 430074, Wuhan, China
| | - Song Liu
- School of Automation, China University of Geosciences, Lumo Road, 430074, Wuhan, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Lumo Road, 430074, Wuhan, China.,Engineering Research Center of Intelligent Technology for Geo-Exploration, Lumo Road, 430074, Wuhan, China
| | - Chao Wang
- Hepatic Surgery Center, Institute of Hepato-Pancreato-Biliary Surgery, Department of Surgery, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Jiefang Avenue, 430030, Wuhan, China
| |
Collapse
|
13
|
Ou-Yang L, Zhang XF, Zhang J, Chen J, Wu M. Editorial: Machine Learning and Mathematical Models for Single-Cell Data Analysis. Front Genet 2022; 13:911999. [PMID: 35719405 PMCID: PMC9204245 DOI: 10.3389/fgene.2022.911999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 05/19/2022] [Indexed: 11/29/2022] Open
Affiliation(s)
- Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
- *Correspondence: Le Ou-Yang,
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics and Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
| | - Jiajun Zhang
- Guangdong Province Key Laboratory of Computational Science, School of Mathematics, Sun Yat-sen University, Guangzhou, China
| | - Jin Chen
- Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, United States
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, Singapore, Singapore
| |
Collapse
|
14
|
Tan YT, Ou-Yang L, Jiang X, Yan H, Zhang XF. Identifying Gene Network Rewiring Based on Partial Correlation. IEEE/ACM Trans Comput Biol Bioinform 2022; 19:513-521. [PMID: 32750866 DOI: 10.1109/tcbb.2020.3002906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
It is an important task to learn how gene regulatory networks change under different conditions. Several Gaussian graphical model-based methods have been proposed to deal with this task by inferring differential networks from gene expression data. However, most existing methods define the differential networks as the difference of precision matrices, which may include false differential edges caused by the change of conditional variances. In addition, prior information about the condition-specific networks and the differential networks can be obtained from other domains. It is useful to incorporate prior information into differential network analysis. In this study, we propose a new differential network analysis method to address the above challenges. Instead of using the precision matrices, we define the differential networks as the difference of partial correlations, which can exclude the spurious differential edges due to the variants of conditional variances. Furthermore, prior information from multiple hypothesis testing is incorporated using a weighted fused penalty. Simulation studies show that our method outperforms the competing methods. We also apply our method to identify the differential network between luminal A and basal-like subtypes of breast cancers and the differential network between acute myeloid leukemia tumors and normal samples. The hub genes in the differential networks identified by our method carry out important biological functions.
Collapse
|
15
|
Ou-Yang L, Lu F, Zhang ZC, Wu M. Matrix factorization for biomedical link prediction and scRNA-seq data imputation: an empirical survey. Brief Bioinform 2021; 23:6447434. [PMID: 34864871 DOI: 10.1093/bib/bbab479] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 09/25/2021] [Accepted: 10/18/2021] [Indexed: 02/02/2023] Open
Abstract
Advances in high-throughput experimental technologies promote the accumulation of vast number of biomedical data. Biomedical link prediction and single-cell RNA-sequencing (scRNA-seq) data imputation are two essential tasks in biomedical data analyses, which can facilitate various downstream studies and gain insights into the mechanisms of complex diseases. Both tasks can be transformed into matrix completion problems. For a variety of matrix completion tasks, matrix factorization has shown promising performance. However, the sparseness and high dimensionality of biomedical networks and scRNA-seq data have raised new challenges. To resolve these issues, various matrix factorization methods have emerged recently. In this paper, we present a comprehensive review on such matrix factorization methods and their usage in biomedical link prediction and scRNA-seq data imputation. Moreover, we select representative matrix factorization methods and conduct a systematic empirical comparison on 15 real data sets to evaluate their performance under different scenarios. By summarizing the experimental results, we provide general guidelines for selecting matrix factorization methods for different biomedical matrix completion tasks and point out some future directions to further improve the performance for biomedical link prediction and scRNA-seq data imputation.
Collapse
Affiliation(s)
- Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China.,Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen,518172, China
| | - Fan Lu
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Zi-Chao Zhang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433, China
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, 138632, Singapore
| |
Collapse
|
16
|
Sun Y, Ou-Yang L, Dai DQ. WMLRR: A Weighted Multi-View Low Rank Representation to Identify Cancer Subtypes From Multiple Types of Omics Data. IEEE/ACM Trans Comput Biol Bioinform 2021; 18:2891-2897. [PMID: 33656995 DOI: 10.1109/tcbb.2021.3063284] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The identification of cancer subtypes is of great importance for understanding the heterogeneity of tumors and providing patients with more accurate diagnoses and treatments. However, it is still a challenge to effectively integrate multiple omics data to establish cancer subtypes. In this paper, we propose an unsupervised integration method, named weighted multi-view low rank representation (WMLRR), to identify cancer subtypes from multiple types of omics data. Given a group of patients described by multiple omics data matrices, we first learn a unified affinity matrix which encodes the similarities among patients by exploring the sparsity-consistent low-rank representations from the joint decompositions of multiple omics data matrices. Unlike existing subtype identification methods that treat each omics data matrix equally, we assign a weight to each omics data matrix and learn these weights automatically through the optimization process. Finally, we apply spectral clustering on the learned affinity matrix to identify cancer subtypes. Experiment results show that the survival times between our identified cancer subtypes are significantly different, and our predicted survivals are more accurate than other state-of-the-art methods. In addition, some clinical analyses of the diseases also demonstrate the effectiveness of our method in identifying molecular subtypes with biological significance and clinical relevance.
Collapse
|
17
|
Lu F, Lin Y, Yuan C, Zhang XF, Ou-Yang L. EnTSSR: A Weighted Ensemble Learning Method to Impute Single-Cell RNA Sequencing Data. IEEE/ACM Trans Comput Biol Bioinform 2021; 18:2781-2787. [PMID: 34495837 DOI: 10.1109/tcbb.2021.3110850] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The advancements of single-cell RNA sequencing (scRNA-seq) technologies have provided us unprecedented opportunities to characterize cellular states and investigate the mechanisms of complex diseases. Due to technical issues such as dropout events, scRNA-seq data contains excess of false zero counts, which has a substantial impact on the downstream analyses. Although several computational approaches have been proposed to impute dropout events in scRNA-seq data, there is no strong consensus on which is the best approach. In this study, we propose a novel weighted ensemble learning method, named EnTSSR, to impute dropout events in scRNA-seq data. By using a multi-view two-side sparse self-representation framework, our model can exploit the consensus similarities between genes and between cells based on the imputed results of various imputation methods. Moreover, we introduce a weighted ensemble strategy to leverage the information captured by various imputation methods effectively. Down-sampling experiments, clustering analysis, differential expression analysis and cell trajectory inference are carried out to evaluate the performance of our proposed model. Experiment results demonstrate that our EnTSSR can effectively recover the true expression pattern of scRNA-seq data.
Collapse
|
18
|
Li HS, Ou-Yang L, Zhu Y, Yan H, Zhang XF. scDEA: differential expression analysis in single-cell RNA-sequencing data via ensemble learning. Brief Bioinform 2021; 23:6375516. [PMID: 34571530 DOI: 10.1093/bib/bbab402] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 08/22/2021] [Accepted: 09/02/2021] [Indexed: 12/13/2022] Open
Abstract
The identification of differentially expressed genes between different cell groups is a crucial step in analyzing single-cell RNA-sequencing (scRNA-seq) data. Even though various differential expression analysis methods for scRNA-seq data have been proposed based on different model assumptions and strategies recently, the differentially expressed genes identified by them are quite different from each other, and the performances of them depend on the underlying data structures. In this paper, we propose a new ensemble learning-based differential expression analysis method, scDEA, to produce a more stable and accurate result. scDEA integrates the P-values obtained from 12 individual differential expression analysis methods for each gene using a P-value combination method. Comprehensive experiments show that scDEA outperforms the state-of-the-art individual methods with different experimental settings and evaluation metrics. We expect that scDEA will serve a wide range of users, including biologists, bioinformaticians and data scientists, who need to detect differentially expressed genes in scRNA-seq data.
Collapse
Affiliation(s)
- Hui-Sheng Li
- School of Mathematics and Statistics, Central China Normal University, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, China
| | - Yuan Zhu
- School of Automation, China University of Geoscience (Wuhan), China
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, Central China Normal University, China
| |
Collapse
|
19
|
Tu JJ, Ou-Yang L, Zhu Y, Yan H, Qin H, Zhang XF. Differential network analysis by simultaneously considering changes in gene interactions and gene expression. Bioinformatics 2021; 37:4414-4423. [PMID: 34245246 DOI: 10.1093/bioinformatics/btab502] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 06/13/2021] [Accepted: 07/05/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Differential network analysis is an important tool to investigate the rewiring of gene interactions under different conditions. Several computational methods have been developed to estimate differential networks from gene expression data, but most of them do not consider that gene network rewiring may be driven by the differential expression of individual genes. New differential network analysis methods that simultaneously take account of the changes in gene interactions and changes in expression levels are needed. RESULTS In this paper, we propose a differential network analysis method that considers the differential expression of individual genes when identifying differential edges. First, two hypothesis test statistics are used to quantify changes in partial correlations between gene pairs and changes in expression levels for individual genes. Then, an optimization framework is proposed to combine the two test statistics so that the resulting differential network has a hierarchical property, where a differential edge can be considered only if at least one of the two involved genes is differentially expressed. Simulation results indicate that our method outperforms current state-of-the-art methods. We apply our method to identify the differential networks between the luminal A and basal-like subtypes of breast cancer and those between acute myeloid leukemia and normal samples. Hub nodes in the differential networks estimated by our method, including both differentially and non-differentially expressed genes, have important biological functions. AVAILABILITY The source code is available at https://github.com/Zhangxf-ccnu/chNet. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jia-Juan Tu
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, 430079, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, 430074, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, China University of Geosciences, Wuhan, 430074, China
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
| | - Hong Qin
- Department of Statistics, Zhongnan University of Economics and Law, Wuhan, 430073, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, 430079, China
| |
Collapse
|
20
|
Xu T, Ou-Yang L, Yan H, Zhang XF. Time-Varying Differential Network Analysis for Revealing Network Rewiring over Cancer Progression. IEEE/ACM Trans Comput Biol Bioinform 2021; 18:1632-1642. [PMID: 31647444 DOI: 10.1109/tcbb.2019.2949039] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
To reveal how gene regulatory networks change over cancer development, multiple time-varying differential networks between adjacent cancer stages should be estimated simultaneously. Since the network rewiring may be driven by the perturbation of certain individual genes, there may be some hub nodes shared by these differential networks. Although several methods have been developed to estimate differential networks from gene expression data, most of them are designed for estimating a single differential network, which neglect the similarities between different differential networks. In this article, we propose a new Gaussian graphical model-based method to jointly estimate multiple time-varying differential networks for identifying network rewiring over cancer development. A D-trace loss is used to determine the differential networks. A tree-structured group Lasso penalty is designed to identify the common hub nodes shared by different differential networks and the specific hub nodes unique to individual differential networks. Simulation experiment results demonstrate that our method outperforms other state-of-the-art techniques in most cases. We also apply our method to The Cancer Genome Atlas data to explore gene network rewiring over different breast cancer stages. Hub nodes in the estimated differential networks rediscover well known genes associated with the development and progression of breast cancer.
Collapse
|
21
|
Ou-Yang L, Cai D, Zhang XF, Yan H. WDNE: an integrative graphical model for inferring differential networks from multi-platform gene expression data with missing values. Brief Bioinform 2021; 22:6272792. [PMID: 33975339 DOI: 10.1093/bib/bbab086] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2021] [Revised: 02/14/2021] [Accepted: 02/23/2021] [Indexed: 11/14/2022] Open
Abstract
The mechanisms controlling biological process, such as the development of disease or cell differentiation, can be investigated by examining changes in the networks of gene dependencies between states in the process. High-throughput experimental methods, like microarray and RNA sequencing, have been widely used to gather gene expression data, which paves the way to infer gene dependencies based on computational methods. However, most differential network analysis methods are designed to deal with fully observed data, but missing values, such as the dropout events in single-cell RNA-sequencing data, are frequent. New methods are needed to take account of these missing values. Moreover, since the changes of gene dependencies may be driven by certain perturbed genes, considering the changes in gene expression levels may promote the identification of gene network rewiring. In this study, a novel weighted differential network estimation (WDNE) model is proposed to handle multi-platform gene expression data with missing values and take account of changes in gene expression levels. Simulation studies demonstrate that WDNE outperforms state-of-the-art differential network estimation methods. When applied WDNE to infer differential gene networks associated with drug resistance in ovarian tumors, cell differentiation and breast tumor heterogeneity, the hub genes in the estimated differential gene networks can provide important insights into the underlying mechanisms. Furthermore, a Matlab toolbox, differential network analysis toolbox, was developed to implement the WDNE model and visualize the estimated differential networks.
Collapse
Affiliation(s)
- Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Dehan Cai
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, 999077, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, 430079, China
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, 999077, China
| |
Collapse
|
22
|
Zhang XF, Ou-Yang L, Yan T, Hu XT, Yan H. A Joint Graphical Model for Inferring Gene Networks Across Multiple Subpopulations and Data Types. IEEE Trans Cybern 2021; 51:1043-1055. [PMID: 31794418 DOI: 10.1109/tcyb.2019.2952711] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Reconstructing gene networks from gene expression data is a long-standing challenge. In most applications, the observations can be divided into several distinct but related subpopulations and the gene expression measurements can be collected from multiple data types. Most existing methods are designed to estimate a single gene network from a single dataset. These methods may be suboptimal since they do not exploit the similarities and differences among different subpopulations and data types. In this article, we propose a joint graphical model to estimate the multiple gene networks simultaneously. Our model decomposes each subpopulation-specific gene network as a sum of common and unique components and imposes a group lasso penalty on gene networks corresponding to different data types. The gene network variations across subpopulations can be learned automatically by the decompositions of networks, and the similarities and differences among data types can be captured by the group lasso penalty. The simulation studies demonstrate that our method outperforms the state-of-the-art methods. We also apply our method to the cancer genome atlas breast cancer datasets to reconstruct subtype-specific gene networks. Hub nodes in the estimated subnetworks unique to individual cancer subtypes rediscover well-known genes associated with breast cancer subtypes and provide interesting predictions.
Collapse
|
23
|
Ata SK, Wu M, Fang Y, Ou-Yang L, Kwoh CK, Li XL. Recent advances in network-based methods for disease gene prediction. Brief Bioinform 2020; 22:6023077. [PMID: 33276376 DOI: 10.1093/bib/bbaa303] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 09/29/2020] [Accepted: 10/10/2020] [Indexed: 01/28/2023] Open
Abstract
Disease-gene association through genome-wide association study (GWAS) is an arduous task for researchers. Investigating single nucleotide polymorphisms that correlate with specific diseases needs statistical analysis of associations. Considering the huge number of possible mutations, in addition to its high cost, another important drawback of GWAS analysis is the large number of false positives. Thus, researchers search for more evidence to cross-check their results through different sources. To provide the researchers with alternative and complementary low-cost disease-gene association evidence, computational approaches come into play. Since molecular networks are able to capture complex interplay among molecules in diseases, they become one of the most extensively used data for disease-gene association prediction. In this survey, we aim to provide a comprehensive and up-to-date review of network-based methods for disease gene prediction. We also conduct an empirical analysis on 14 state-of-the-art methods. To summarize, we first elucidate the task definition for disease gene prediction. Secondly, we categorize existing network-based efforts into network diffusion methods, traditional machine learning methods with handcrafted graph features and graph representation learning methods. Thirdly, an empirical analysis is conducted to evaluate the performance of the selected methods across seven diseases. We also provide distinguishing findings about the discussed methods based on our empirical analysis. Finally, we highlight potential research directions for future studies on disease gene prediction.
Collapse
Affiliation(s)
- Sezin Kircali Ata
- School of Computer Science and Engineering Nanyang Technological University (NTU)
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, Singapore
| | - Yuan Fang
- School of Information Systems, Singapore Management University, Singapore
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen China
| | | | - Xiao-Li Li
- Department head and principal scientist at I2R, A*STAR, Singapore
| |
Collapse
|
24
|
Zhu Y, Zhang DX, Zhang XF, Yi M, Ou-Yang L, Wu M. EC-PGMGR: Ensemble Clustering Based on Probability Graphical Model With Graph Regularization for Single-Cell RNA-seq Data. Front Genet 2020; 11:572242. [PMID: 33329710 PMCID: PMC7673820 DOI: 10.3389/fgene.2020.572242] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 09/30/2020] [Indexed: 11/21/2022] Open
Abstract
Advances in technology have made it convenient to obtain a large amount of single cell RNAsequencing (scRNA-seq) data. Since that clustering is a very important step in identifying or defining cellular phenotypes, many clustering approaches have been developed recently for these applications. The general methods can be roughly divided into normal clustering methods and integrated (ensemble) clustering methods which combine more than two normal clustering methods aiming to get much more informative performance. In order to make a contrast with the integrated clustering algorithm, the normal clustering method is often called individual or base clustering method. Note that the results of many individual clustering methods are often developed to capture one aspect of the data, and the results depend on the initial parameter settings, such as cluster number, distance metric and so on. Compared with individual clustering, although integrative clustering method may get much more accurate performance, the results depend on the base clustering results and integrated systems are often not self-regulation. Therefore, how to design a robust unsupervised clustering method is still a challenge. In order to tackle above limitations, we propose a novel Ensemble Clustering algorithm based on Probability Graphical Model with Graph Regularization, which is called EC-PGMGR for short. On one hand, we use parameter controlling in Probability Graphical Model (PGM) to automatically determine the cluster number without prior knowledge. On the other hand, we add a regularization term to reduce the effect deriving from some weak base clustering results. Particularly, the integrative results collected from base clustering methods can be assembled in the form of combination with self-regulation weights through a pre-learning process, which can efficiently enhance the effect of active clustering methods while weaken the effect of inactive clustering methods. Experiments are carried out on 7 data sets generated by different platforms with the number of single cells from 822 to 5,132. Results show that EC-PGMGR performs better than 4 alternative individual clustering methods and 2 ensemble methods in terms of accuracy including Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), robustness, effectiveness and so on. EC-PGMGR provides an effective way to integrate different clustering results for more accurate and reliable results in further biological analysis as well. It may provide some new insights to the other applications of clustering.
Collapse
Affiliation(s)
- Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
| | - De-Xin Zhang
- School of Automation, China University of Geosciences, Wuhan, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
| | - Xiao-Fei Zhang
- Department of Statistics, School of Mathematics and Statistics, Central China Normal University, Wuhan, China
| | - Ming Yi
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| |
Collapse
|
25
|
Ou-Yang L, Zhang XF, Hu X, Yan H. Differential Network Analysis via Weighted Fused Conditional Gaussian Graphical Model. IEEE/ACM Trans Comput Biol Bioinform 2020; 17:2162-2169. [PMID: 31247559 DOI: 10.1109/tcbb.2019.2924418] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The development and prognosis of complex diseases usually involves changes in regulatory relationships among biomolecules. Understanding how the regulatory relationships change with genetic alterations can help to reveal the underlying biological mechanisms for complex diseases. Although several models have been proposed to estimate the differential network between two different states, they are not suitable to deal with situations where the molecules of interest are affected by other covariates. Nor can they make use of prior information that provides insights about the structures of biomolecular networks. In this study, we introduce a novel weighted fused conditional Gaussian graphical model to jointly estimate two state-specific biomolecular regulatory networks and their difference between two different states. Unlike previous differential network estimation methods, our model can take into account the related covariates and the prior network information when inferring differential networks. The effectiveness of our proposed model is first evaluated based on simulation studies. Experiment results demonstrate that our model outperforms other state-of-the-art differential networks estimation models in all cases. We then apply our model to identify the differential gene network between two subtypes of glioblastoma based on gene expression and miRNA expression data. Our model is able to discover known mechanisms of glioblastoma and provide interesting predictions.
Collapse
|
26
|
Tu JJ, Ou-Yang L, Yan H, Zhang XF, Qin H. Joint reconstruction of multiple gene networks by simultaneously capturing inter-tumor and intra-tumor heterogeneity. Bioinformatics 2020; 36:2755-2762. [PMID: 31971577 DOI: 10.1093/bioinformatics/btaa014] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Revised: 12/22/2019] [Accepted: 01/18/2020] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Reconstruction of cancer gene networks from gene expression data is important for understanding the mechanisms underlying human cancer. Due to heterogeneity, the tumor tissue samples for a single cancer type can be divided into multiple distinct subtypes (inter-tumor heterogeneity) and are composed of non-cancerous and cancerous cells (intra-tumor heterogeneity). If tumor heterogeneity is ignored when inferring gene networks, the edges specific to individual cancer subtypes and cell types cannot be characterized. However, most existing network reconstruction methods do not simultaneously take inter-tumor and intra-tumor heterogeneity into account. RESULTS In this article, we propose a new Gaussian graphical model-based method for jointly estimating multiple cancer gene networks by simultaneously capturing inter-tumor and intra-tumor heterogeneity. Given gene expression data of heterogeneous samples for different cancer subtypes, a non-cancerous network shared across different cancer subtypes and multiple subtype-specific cancerous networks are estimated jointly. Tumor heterogeneity can be revealed by the difference in the estimated networks. The performance of our method is first evaluated using simulated data, and the results indicate that our method outperforms other state-of-the-art methods. We also apply our method to The Cancer Genome Atlas breast cancer data to reconstruct non-cancerous and subtype-specific cancerous gene networks. Hub nodes in the networks estimated by our method perform important biological functions associated with breast cancer development and subtype classification. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/Zhangxf-ccnu/NETI2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jia-Juan Tu
- Department of Statistics, Hubei Key Laboratory of Mathematical Sciences, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen 518060, China
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong 999077, China
| | - Xiao-Fei Zhang
- Department of Statistics, Hubei Key Laboratory of Mathematical Sciences, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai 200433, China
| | - Hong Qin
- Department of Statistics, Hubei Key Laboratory of Mathematical Sciences, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China.,Department of Statistics, Zhongnan University of Economics and Law, Wuhan 430073, China
| |
Collapse
|
27
|
Abstract
The development of single-cell RNA-sequencing (scRNA-seq) technologies brings tremendous opportunities for quantitative research and analyses at the cellular level. In particular, as a crucial task of scRNA-seq analysis, single cell clustering shines a light on natural groupings of cells to give new insights into the biological mechanisms and disease studies. However, it remains a challenge to identify cell clusters from lots of cell mixtures effectively and accurately. In this paper, we propose a novel adaptive joint clustering framework, named the low-rank self-representation K-means method (LRSK), to learn the data representation matrix and cluster indicator matrix jointly from scRNA-seq data. Specifically, instead of calculating the similarities among cells from the original data, we seek a low-rank representation of the original data to better reflect the underlying relationships among cells. Moreover, an Augmented Lagrangian Multiplier (ALM) based optimization algorithm is adopted to solve this problem. Experimental results on various scRNA-seq datasets and case studies demonstrate that our method performs better than other state-of-the-art single cell clustering algorithms. The analysis of unlabeled large single-cell liver cancer sequencing data further shows that our prediction results are more reasonable and interpretable.
Collapse
Affiliation(s)
- Ye-Sen Sun
- Intelligent Data Center, School of Mathematics, Sun Yat-sen University, Guangzhou, China.
| | | | | |
Collapse
|
28
|
Wu N, Yin F, Ou-Yang L, Zhu Z, Xie W. Joint learning of multiple gene networks from single-cell gene expression data. Comput Struct Biotechnol J 2020; 18:2583-2595. [PMID: 33033579 PMCID: PMC7527714 DOI: 10.1016/j.csbj.2020.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/31/2020] [Accepted: 09/01/2020] [Indexed: 11/24/2022] Open
Abstract
Inferring gene networks from gene expression data is important for understanding functional organizations within cells. With the accumulation of single-cell RNA sequencing (scRNA-seq) data, it is possible to infer gene networks at single cell level. However, due to the characteristics of scRNA-seq data, such as cellular heterogeneity and high sparsity caused by dropout events, traditional network inference methods may not be suitable for scRNA-seq data. In this study, we introduce a novel joint Gaussian copula graphical model (JGCGM) to jointly estimate multiple gene networks for multiple cell subgroups from scRNA-seq data. Our model can deal with non-Gaussian data with missing values, and identify the common and unique network structures of multiple cell subgroups, which is suitable for scRNA-seq data. Extensive experiments on synthetic data demonstrate that our proposed model outperforms other compared state-of-the-art network inference models. We apply our model to real scRNA-seq data sets to infer gene networks of different cell subgroups. Hub genes in the estimated gene networks are found to be biological significance.
Collapse
Affiliation(s)
- Nuosi Wu
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Fu Yin
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), Shenzhen University, Shenzhen, China
- Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Weixin Xie
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| |
Collapse
|
29
|
Zhang XF, Ou-Yang L, Yang S, Zhao XM, Hu X, Yan H. EnImpute: imputing dropout events in single-cell RNA-sequencing data via ensemble learning. Bioinformatics 2020; 35:4827-4829. [PMID: 31125056 DOI: 10.1093/bioinformatics/btz435] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2019] [Revised: 04/10/2019] [Accepted: 05/21/2019] [Indexed: 12/22/2022] Open
Abstract
SUMMARY Imputation of dropout events that may mislead downstream analyses is a key step in analyzing single-cell RNA-sequencing (scRNA-seq) data. We develop EnImpute, an R package that introduces an ensemble learning method for imputing dropout events in scRNA-seq data. EnImpute combines the results obtained from multiple imputation methods to generate a more accurate result. A Shiny application is developed to provide easier implementation and visualization. Experiment results show that EnImpute outperforms the individual state-of-the-art methods in almost all situations. EnImpute is useful for correcting the noisy scRNA-seq data before performing downstream analysis. AVAILABILITY AND IMPLEMENTATION The R package and Shiny application are available through Github at https://github.com/Zhangxf-ccnu/EnImpute. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- Department of Statistics, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen 518060, China
| | - Shuo Yang
- Department of Respiratory Medicine, Wuhan Number 1 Hospital, Wuhan 430022, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Xiaohua Hu
- Department of Computer Science, College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, USA
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
| |
Collapse
|
30
|
Zhang XF, Ou-Yang L, Yang S, Hu X, Yan H. DiffNetFDR: differential network analysis with false discovery rate control. Bioinformatics 2020; 35:3184-3186. [PMID: 30689728 DOI: 10.1093/bioinformatics/btz051] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Revised: 01/10/2019] [Accepted: 01/20/2019] [Indexed: 11/13/2022] Open
Abstract
SUMMARY To identify biological network rewiring under different conditions, we develop a user-friendly R package, named DiffNetFDR, to implement two methods developed for testing the difference in different Gaussian graphical models. Compared to existing tools, our methods have the following features: (i) they are based on Gaussian graphical models which can capture the changes of conditional dependencies; (ii) they determine the tuning parameters in a data-driven manner; (iii) they take a multiple testing procedure to control the overall false discovery rate; and (iv) our approach defines the differential network based on partial correlation coefficients so that the spurious differential edges caused by the variants of conditional variances can be excluded. We also develop a Shiny application to provide easier analysis and visualization. Simulation studies are conducted to evaluate the performance of our methods. We also apply our methods to two real gene expression datasets. The effectiveness of our methods is validated by the biological significance of the identified differential networks. AVAILABILITY AND IMPLEMENTATION R package and Shiny app are available at https://github.com/Zhangxf-ccnu/DiffNetFDR. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- Department of Statistics, School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
| | - Le Ou-Yang
- Department of Electronic Engineering, Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen, China
| | - Shuo Yang
- Department of Respiratory Medicine, Wuhan Number 1 Hospital, Wuhan, China
| | - Xiaohua Hu
- Department of Information Science, College of Computing and Informatics, Drexel University, Philadelphia, USA
| | - Hong Yan
- Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China
| |
Collapse
|
31
|
Huang F, Tan EL, Yang P, Huang S, Ou-Yang L, Cao J, Wang T, Lei B. Self-weighted adaptive structure learning for ASD diagnosis via multi-template multi-center representation. Med Image Anal 2020; 63:101662. [PMID: 32442865 DOI: 10.1016/j.media.2020.101662] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 01/13/2020] [Accepted: 01/31/2020] [Indexed: 11/25/2022]
Abstract
As a kind of neurodevelopmental disease, autism spectrum disorder (ASD) can cause severe social, communication, interaction, and behavioral challenges. To date, many imaging-based machine learning techniques have been proposed to address ASD diagnosis issues. However, most of these techniques are restricted to a single template or dataset from one imaging center. In this paper, we propose a novel multi-template multi-center ensemble classification scheme for automatic ASD diagnosis. Specifically, based on different pre-defined templates, we construct multiple functional connectivity (FC) brain networks for each subject based on our proposed Pearson's correlation-based sparse low-rank representation. After extracting features from these FC networks, informative features to learn optimal similarity matrix are then selected by our self-weighted adaptive structure learning (SASL) model. For each template, the SASL method automatically assigns an optimal weight learned from the structural information without additional weights and parameters. Finally, an ensemble strategy based on the multi- template multi-center representations is applied to derive the final diagnosis results. Extensive experiments are conducted on the publicly available Autism Brain Imaging Data Exchange (ABIDE) database to demonstrate the efficacy of our proposed method. Experimental results verify that our proposed method boosts ASD diagnosis performance and outperforms state-of-the-art methods.
Collapse
Affiliation(s)
- Fanglin Huang
- National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China
| | - Ee-Leng Tan
- School of Electrical and Electronic Engineering, Nanyang Technological University, 639798, Singapore
| | - Peng Yang
- National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China
| | - Shan Huang
- National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, College of Information Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jiuwen Cao
- Artificial Intelligence Institute, Hangzhou Dianzi University, Zhejiang 310010, China
| | - Tianfu Wang
- National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China.
| | - Baiying Lei
- National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China.
| |
Collapse
|
32
|
Zhang ZC, Zhang XF, Wu M, Ou-Yang L, Zhao XM, Li XL. A graph regularized generalized matrix factorization model for predicting links in biomedical bipartite networks. Bioinformatics 2020; 36:3474-3481. [DOI: 10.1093/bioinformatics/btaa157] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 02/05/2020] [Accepted: 03/03/2020] [Indexed: 12/13/2022] Open
Abstract
Abstract
Motivation
Predicting potential links in biomedical bipartite networks can provide useful insights into the diagnosis and treatment of complex diseases and the discovery of novel drug targets. Computational methods have been proposed recently to predict potential links for various biomedical bipartite networks. However, existing methods are usually rely on the coverage of known links, which may encounter difficulties when dealing with new nodes without any known link information.
Results
In this study, we propose a new link prediction method, named graph regularized generalized matrix factorization (GRGMF), to identify potential links in biomedical bipartite networks. First, we formulate a generalized matrix factorization model to exploit the latent patterns behind observed links. In particular, it can take into account the neighborhood information of each node when learning the latent representation for each node, and the neighborhood information of each node can be learned adaptively. Second, we introduce two graph regularization terms to draw support from affinity information of each node derived from external databases to enhance the learning of latent representations. We conduct extensive experiments on six real datasets. Experiment results show that GRGMF can achieve competitive performance on all these datasets, which demonstrate the effectiveness of GRGMF in prediction potential links in biomedical bipartite networks.
Availability and implementation
The package is available at https://github.com/happyalfred2016/GRGMF.
Contact
leouyang@szu.edu.cn
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zi-Chao Zhang
- Guangdong Key Laboratory of Intelligent Information Processing, Key Laboratory of Media Security, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, 138632, Singapore
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Key Laboratory of Media Security, Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, Shenzhen 518060, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Ministry of Education, 200433 China
| | - Xiao-Li Li
- Institute for Infocomm Research (I2R), A*STAR, 138632, Singapore
| |
Collapse
|
33
|
Yuan R, Ou-Yang L, Hu X, Zhang XF. Identifying Gene Network Rewiring Using Robust Differential Graphical Model with Multivariate t-Distribution. IEEE/ACM Trans Comput Biol Bioinform 2020; 17:712-718. [PMID: 30802872 DOI: 10.1109/tcbb.2019.2901473] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Identifying gene network rewiring under different biological conditions is important for understanding the mechanisms underlying complex diseases. Gaussian graphical models, which assume the data follow the multivariate normal distribution, are widely used to identify gene network rewiring. However, the normality assume often fails in reality since the data are contaminated by extreme outliers in general. In this study, we propose a new robust differential graphical model to identify gene network rewiring between two conditions based on the multivariate t-distribution. The multivariate t-distribution is more robust to outliers than the normal distribution since it has heavy tails and allows values far from the mean. A fused lasso penalty is used to borrow information across conditions to improve the results. We develop an expectation maximization algorithm to solve the optimization model. Experiment results on simulated data show that our method outperforms the state-of-the-art methods. Our method is also applied to identify gene network rewiring between luminal A and basal-like subtypes of breast cancer, and gene network rewiring between the proneural and mesenchymal subtypes of glioblastoma. Several key genes which drive gene network rewiring are discovered.
Collapse
|
34
|
Wang DD, Ou-Yang L, Xie H, Zhu M, Yan H. Predicting the impacts of mutations on protein-ligand binding affinity based on molecular dynamics simulations and machine learning methods. Comput Struct Biotechnol J 2020; 18:439-454. [PMID: 32153730 PMCID: PMC7052406 DOI: 10.1016/j.csbj.2020.02.007] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2019] [Revised: 01/31/2020] [Accepted: 02/11/2020] [Indexed: 01/19/2023] Open
Abstract
Purpose Mutation-induced variation of protein-ligand binding affinity is the key to many genetic diseases and the emergence of drug resistance, and therefore predicting such mutation impacts is of great importance. In this work, we aim to predict the mutation impacts on protein-ligand binding affinity using efficient structure-based, computational methods. Methods Relying on consolidated databases of experimentally determined data we characterize the affinity change upon mutation based on a number of local geometrical features and monitor such feature differences upon mutation during molecular dynamics (MD) simulations. The differences are quantified according to average difference, trajectory-wise distance or time-vary differences. Machine-learning methods are employed to predict the mutation impacts using the resulting conventional or time-series features. Predictions based on estimation of energy and based on investigation of molecular descriptors were conducted as benchmarks. Results Our method (machine-learning techniques using time-series features) outperformed the benchmark methods, especially in terms of the balanced F1 score. Particularly, deep-learning models led to the best prediction performance with distinct improvements in balanced F1 score and a sustained accuracy. Conclusion Our work highlights the effectiveness of the characterization of affinity change upon mutations. Furthermore, deep-learning techniques are well designed for handling the extracted time-series features. This study can lead to a deeper understanding of mutation-induced diseases and resistance, and further guide the development of innovative drug design.
Collapse
Key Words
- CNN, convolutional neural network
- Deep learning
- HMM, hidden Markov model
- LSTM, long short-term memory
- Local geometrical features
- MD, molecular dynamics
- MM/GBSA, molecular mechanics/generalized born surface area
- MM/PBSA, molecular mechanics/Poisson-Boltzmann surface area
- Missense mutation
- Molecular dynamics (MD) simulations
- Mutation impact
- Protein-ligand binding affinity
- RF, random forest
- RMSD, root-mean-square deviation
- RNN, recurrent neural network
- SASA, solvent accessible surface area
- Time series features
- WTP, wildtype protein
- aacomp, amino acid composition descriptors
- const, constitutional descriptors
- ctd, composition transition and distribution descriptors
- kappa, Kappa shape indices
- paacomp, type 1 pseudo amino acid composition descriptors
- top, topological descriptors
Collapse
Affiliation(s)
- Debby D. Wang
- Institute of Medical Information Engineering, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Rd, Shanghai 200093, China
- Corresponding author at: Institute of Medical Information Engineering, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Rd, Shanghai 200093, China.
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, College of Electronics and Information Engineering, Shenzhen University, 3688 Nanhai Ave, Shenzhen 518060, China
- Corresponding author at: Institute of Medical Information Engineering, School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology, 516 Jungong Rd, Shanghai 200093, China.
| | - Haoran Xie
- Department of Computing and Decision Sciences, Lingnan University, 8 Castle Peak Rd, Tuen Mun, Hong Kong
| | - Mengxu Zhu
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| |
Collapse
|
35
|
Jin K, Ou-Yang L, Zhao XM, Yan H, Zhang XF. scTSSR: gene expression recovery for single-cell RNA sequencing using two-side sparse self-representation. Bioinformatics 2020; 36:3131-3138. [DOI: 10.1093/bioinformatics/btaa108] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 01/19/2020] [Accepted: 02/12/2020] [Indexed: 11/13/2022] Open
Abstract
Abstract
Motivation
Single-cell RNA sequencing (scRNA-seq) methods make it possible to reveal gene expression patterns at single-cell resolution. Due to technical defects, dropout events in scRNA-seq will add noise to the gene-cell expression matrix and hinder downstream analysis. Therefore, it is important for recovering the true gene expression levels before carrying out downstream analysis.
Results
In this article, we develop an imputation method, called scTSSR, to recover gene expression for scRNA-seq. Unlike most existing methods that impute dropout events by borrowing information across only genes or cells, scTSSR simultaneously leverages information from both similar genes and similar cells using a two-side sparse self-representation model. We demonstrate that scTSSR can effectively capture the Gini coefficients of genes and gene-to-gene correlations observed in single-molecule RNA fluorescence in situ hybridization (smRNA FISH). Down-sampling experiments indicate that scTSSR performs better than existing methods in recovering the true gene expression levels. We also show that scTSSR has a competitive performance in differential expression analysis, cell clustering and cell trajectory inference.
Availability and implementation
The R package is available at https://github.com/Zhangxf-ccnu/scTSSR.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ke Jin
- School of Mathematics and Statistics, Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan 430079, China
| | - Le Ou-Yang
- College of Information Engineering, Shenzhen University, Shenzhen 518060, China
| | - Xing-Ming Zhao
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai 200433, China
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan 430079, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai 200433, China
| |
Collapse
|
36
|
Huang J, Wu M, Lu F, Ou-Yang L, Zhu Z. Predicting synthetic lethal interactions in human cancers using graph regularized self-representative matrix factorization. BMC Bioinformatics 2019; 20:657. [PMID: 31870274 PMCID: PMC6929405 DOI: 10.1186/s12859-019-3197-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2019] [Accepted: 11/05/2019] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Synthetic lethality has attracted a lot of attentions in cancer therapeutics due to its utility in identifying new anticancer drug targets. Identifying synthetic lethal (SL) interactions is the key step towards the exploration of synthetic lethality in cancer treatment. However, biological experiments are faced with many challenges when identifying synthetic lethal interactions. Thus, it is necessary to develop computational methods which could serve as useful complements to biological experiments. RESULTS In this paper, we propose a novel graph regularized self-representative matrix factorization (GRSMF) algorithm for synthetic lethal interaction prediction. GRSMF first learns the self-representations from the known SL interactions and further integrates the functional similarities among genes derived from Gene Ontology (GO). It can then effectively predict potential SL interactions by leveraging the information provided by known SL interactions and functional annotations of genes. Extensive experiments on the synthetic lethal interaction data downloaded from SynLethDB database demonstrate the superiority of our GRSMF in predicting potential synthetic lethal interactions, compared with other competing methods. Moreover, case studies of novel interactions are conducted in this paper for further evaluating the effectiveness of GRSMF in synthetic lethal interaction prediction. CONCLUSIONS In this paper, we demonstrate that by adaptively exploiting the self-representation of original SL interaction data, and utilizing functional similarities among genes to enhance the learning of self-representation matrix, our GRSMF could predict potential SL interactions more accurately than other state-of-the-art SL interaction prediction methods.
Collapse
Affiliation(s)
- Jiang Huang
- College of Computer Science and Software Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, 518060, China
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, 1 Fusionopolis Way, Singapore, Singapore
| | - Fan Lu
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, College of Electronics and Information Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, 518060, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, College of Electronics and Information Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, 518060, China. .,Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China.
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Nanhai Ave 3688, Shenzhen, 518060, China.
| |
Collapse
|
37
|
Ou-Yang L, Zhang XF, Zhao XM, Wang DD, Wang FL, Lei B, Yan H. Joint Learning of Multiple Differential Networks With Latent Variables. IEEE Trans Cybern 2019; 49:3494-3506. [PMID: 29994625 DOI: 10.1109/tcyb.2018.2845838] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Graphical models have been widely used to learn the conditional dependence structures among random variables. In many controlled experiments, such as the studies of disease or drug effectiveness, learning the structural changes of graphical models under two different conditions is of great importance. However, most existing graphical models are developed for estimating a single graph and based on a tacit assumption that there is no missing relevant variables, which wastes the common information provided by multiple heterogeneous data sets and underestimates the influence of latent/unobserved relevant variables. In this paper, we propose a joint differential network analysis (JDNA) model to jointly estimate multiple differential networks with latent variables from multiple data sets. The JDNA model is built on a penalized D-trace loss function, with group lasso or generalized fused lasso penalties. We implement a proximal gradient-based alternating direction method of multipliers to tackle the corresponding convex optimization problems. Extensive simulation experiments demonstrate that JDNA model outperforms state-of-the-art methods in estimating the structural changes of graphical models. Moreover, a series of experiments on several real-world data sets have been performed and experiment results consistently show that our proposed JDNA model is effective in identifying differential networks under different conditions.
Collapse
|
38
|
Wu N, Huang J, Zhang XF, Ou-Yang L, He S, Zhu Z, Xie W. Weighted Fused Pathway Graphical Lasso for Joint Estimation of Multiple Gene Networks. Front Genet 2019; 10:623. [PMID: 31396259 PMCID: PMC6662592 DOI: 10.3389/fgene.2019.00623] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Accepted: 06/13/2019] [Indexed: 01/17/2023] Open
Abstract
Gene regulatory networks (GRNs) are often inferred based on Gaussian graphical models that could identify the conditional dependence among genes by estimating the corresponding precision matrix. Classical Gaussian graphical models are usually designed for single network estimation and ignore existing knowledge such as pathway information. Therefore, they can neither make use of the common information shared by multiple networks, nor can they utilize useful prior information to guide the estimation. In this paper, we propose a new weighted fused pathway graphical lasso (WFPGL) to jointly estimate multiple networks by incorporating prior knowledge derived from known pathways and gene interactions. Based on the assumption that two genes are less likely to be connected if they do not participate together in any pathways, a pathway-based constraint is considered in our model. Moreover, we introduce a weighted fused lasso penalty in our model to take into account prior gene interaction data and common information shared by multiple networks. Our model is optimized based on the alternating direction method of multipliers (ADMM). Experiments on synthetic data demonstrate that our method outperforms other five state-of-the-art graphical models. We then apply our model to two real datasets. Hub genes in our identified state-specific networks show some shared and specific patterns, which indicates the efficiency of our model in revealing the underlying mechanisms of complex diseases.
Collapse
Affiliation(s)
- Nuosi Wu
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Jiang Huang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
- Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham, United Kingdom
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Weixin Xie
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| |
Collapse
|
39
|
Ou-Yang L, Huang J, Zhang XF, Li YR, Sun Y, He S, Zhu Z. LncRNA-Disease Association Prediction Using Two-Side Sparse Self-Representation. Front Genet 2019; 10:476. [PMID: 31191605 PMCID: PMC6546878 DOI: 10.3389/fgene.2019.00476] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2018] [Accepted: 05/03/2019] [Indexed: 01/04/2023] Open
Abstract
Evidences increasingly indicate the involvement of long non-coding RNAs (lncRNAs) in various biological processes. As the mutations and abnormalities of lncRNAs are closely related to the progression of complex diseases, the identification of lncRNA-disease associations has become an important step toward the understanding and treatment of diseases. Since only a limited number of lncRNA-disease associations have been validated, an increasing number of computational approaches have been developed for predicting potential lncRNA-disease associations. However, how to predict potential associations precisely through computational approaches remains challenging. In this study, we propose a novel two-side sparse self-representation (TSSR) algorithm for lncRNA-disease association prediction. By learning the self-representations of lncRNAs and diseases from known lncRNA-disease associations adaptively, and leveraging the information provided by known lncRNA-disease associations and the intra-associations among lncRNAs and diseases derived from other existing databases, our model could effectively utilize the estimated representations of lncRNAs and diseases to predict potential lncRNA-disease associations. The experiment results on three real data sets demonstrate that our TSSR outperforms other competing methods significantly. Moreover, to further evaluate the effectiveness of TSSR in predicting potential lncRNAs-disease associations, case studies of Melanoma, Glioblastoma, and Glioma are carried out in this paper. The results demonstrate that TSSR can effectively identify some candidate lncRNAs associated with these three diseases.
Collapse
Affiliation(s)
- Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen, China.,FJKLMAA (Fujian Key Laborotary of Mathematical Analysis and Applications), Fujian Normal University, Fuzhou, China
| | - Jiang Huang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics and Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
| | - Yan-Ran Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Yiwen Sun
- School of Medicine, Shenzhen University, Shenzhen, China
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham, United Kingdom
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| |
Collapse
|
40
|
Ata SK, Ou-Yang L, Fang Y, Kwoh CK, Wu M, Li XL. Integrating node embeddings and biological annotations for genes to predict disease-gene associations. BMC Syst Biol 2018; 12:138. [PMID: 30598097 PMCID: PMC6311944 DOI: 10.1186/s12918-018-0662-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND Predicting disease causative genes (or simply, disease genes) has played critical roles in understanding the genetic basis of human diseases and further providing disease treatment guidelines. While various computational methods have been proposed for disease gene prediction, with the recent increasing availability of biological information for genes, it is highly motivated to leverage these valuable data sources and extract useful information for accurately predicting disease genes. RESULTS We present an integrative framework called N2VKO to predict disease genes. Firstly, we learn the node embeddings from protein-protein interaction (PPI) network for genes by adapting the well-known representation learning method node2vec. Secondly, we combine the learned node embeddings with various biological annotations as rich feature representation for genes, and subsequently build binary classification models for disease gene prediction. Finally, as the data for disease gene prediction is usually imbalanced (i.e. the number of the causative genes for a specific disease is much less than that of its non-causative genes), we further address this serious data imbalance issue by applying oversampling techniques for imbalance data correction to improve the prediction performance. Comprehensive experiments demonstrate that our proposed N2VKO significantly outperforms four state-of-the-art methods for disease gene prediction across seven diseases. CONCLUSIONS In this study, we show that node embeddings learned from PPI networks work well for disease gene prediction, while integrating node embeddings with other biological annotations further improves the performance of classification models. Moreover, oversampling techniques for imbalance correction further enhances the prediction performance. In addition, the literature search of predicted disease genes also shows the effectiveness of our proposed N2VKO framework for disease gene prediction.
Collapse
Affiliation(s)
- Sezin Kircali Ata
- Department of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Le Ou-Yang
- Department of Electronic Engineering, College of Information Engineering, Shenzhen University, China, Singapore, Singapore
| | - Yuan Fang
- School of Information Systems, Singapore Management University, Singapore, Singapore
| | - Chee-Keong Kwoh
- Department of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Min Wu
- Data Analytics Department, Institute for Infocomm Research, Singapore, Singapore.
| | - Xiao-Li Li
- Data Analytics Department, Institute for Infocomm Research, Singapore, Singapore
| |
Collapse
|
41
|
Xu T, Ou-Yang L, Hu X, Zhang XF. Identifying Gene Network Rewiring by Integrating Gene Expression and Gene Network Data. IEEE/ACM Trans Comput Biol Bioinform 2018; 15:2079-2085. [PMID: 29994068 DOI: 10.1109/tcbb.2018.2809603] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Exploring the rewiring pattern of gene regulatory networks between different pathological states is an important task in bioinformatics. Although a number of computational approaches have been developed to infer differential networks from high-throughput data, most of them only focus on gene expression data. The valuable static gene regulatory network data accumulated in recent biomedical researches are neglected. In this study, we propose a new Gaussian graphical model-based method to infer differential networks by integrating gene expression and static gene regulatory network data. We first evaluate the empirical performance of our method by comparing with the state-of-the-art methods using simulation data. We also apply our method to The Cancer Genome Atlas data to identify gene network rewiring between ovarian cancers with different platinum responses, and rewiring between breast cancers of luminal A subtype and basal-like subtype. Hub genes in the estimated differential networks rediscover known genes associated with platinum resistance in ovarian cancer and signatures of the breast cancer intrinsic subtypes.
Collapse
|
42
|
Tu JJ, Ou-Yang L, Hu X, Zhang XF. Identifying gene network rewiring by combining gene expression and gene mutation data. IEEE/ACM Trans Comput Biol Bioinform 2018; 16:1042-1048. [PMID: 29993891 DOI: 10.1109/tcbb.2018.2834529] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Understanding how gene dependency networks rewire between different disease states is an important task in genomic research. Although many computational methods have been proposed to undertake this task via differential network analysis, most of them are designed for a predefined data type. With the development of the high throughput technologies, gene activity measurements can be collected from different aspects (e.g., mRNA expression and DNA mutation). Different data types might share some common characteristics and include certain unique properties. New methods are needed to explore the similarity and difference between differential networks estimated from different data types. In this study, we develop a new differential network inference model which identifies gene network rewiring by combining gene expression and gene mutation data. Similarity and difference between different data types are learned via a group bridge penalty function. Simulation studies have demonstrated that our method consistently outperforms the competing methods. We also apply our method to identify gene network rewiring associated with ovarian cancer platinum resistance. There are certain differential edges common to both data types and some differential edges unique to individual data types. Hub genes in the differential networks inferred by our method play important roles in ovarian cancer drug resistance.
Collapse
|
43
|
Zhang XF, Ou-Yang L, Yang S, Hu X, Yan H. DiffGraph: an R package for identifying gene network rewiring using differential graphical models. Bioinformatics 2017; 34:1571-1573. [DOI: 10.1093/bioinformatics/btx836] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 12/21/2017] [Indexed: 01/28/2023] Open
Affiliation(s)
- Xiao-Fei Zhang
- Department of Statistics, School of Mathematics and Statistics, Central China Normal University, Wuhan, China
| | - Le Ou-Yang
- Department of Electronic Engineering, College of Information Engineering, Shenzhen University, Shenzhen, China
| | - Shuo Yang
- Department of Respiratory Medicine, Wuhan Number 1 Hospital, Wuhan, China
| | - Xiaohua Hu
- Department of Computer Science, School of Computer, Central China Normal University, Wuhan, China
- Department of Information Science, College of Computing and Informatics, Drexel University, Philadelphia, USA
| | - Hong Yan
- Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China
| |
Collapse
|
44
|
Abstract
Background The accurate identification of protein complexes is important for the understanding of cellular organization. Up to now, computational methods for protein complex detection are mostly focus on mining clusters from protein-protein interaction (PPI) networks. However, PPI data collected by high-throughput experimental techniques are known to be quite noisy. It is hard to achieve reliable prediction results by simply applying computational methods on PPI data. Behind protein interactions, there are protein domains that interact with each other. Therefore, based on domain-protein associations, the joint analysis of PPIs and domain-domain interactions (DDI) has the potential to obtain better performance in protein complex detection. As traditional computational methods are designed to detect protein complexes from a single PPI network, it is necessary to design a new algorithm that could effectively utilize the information inherent in multiple heterogeneous networks. Results In this paper, we introduce a novel multi-network clustering algorithm to detect protein complexes from multiple heterogeneous networks. Unlike existing protein complex identification algorithms that focus on the analysis of a single PPI network, our model can jointly exploit the information inherent in PPI and DDI data to achieve more reliable prediction results. Extensive experiment results on real-world data sets demonstrate that our method can predict protein complexes more accurately than other state-of-the-art protein complex identification algorithms. Conclusions In this work, we demonstrate that the joint analysis of PPI network and DDI network can help to improve the accuracy of protein complex detection.
Collapse
Affiliation(s)
- Le Ou-Yang
- College of Information Engineering & Shenzhen Key Laboratory of Media Security, Shenzhen University, Nanhai Ave 3688, Shenzhen, 518060, China
| | - Hong Yan
- College of Information Engineering & Shenzhen Key Laboratory of Media Security, Shenzhen University, Nanhai Ave 3688, Shenzhen, 518060, China.,Department of Electronic and Engineering, City University of Hong Kong, Tat Chee Avenue, Hong Kong, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, 430079, China.
| |
Collapse
|
45
|
Guo R, Li YR, He S, Ou-Yang L, Sun Y, Zhu Z. RepLong: de novo repeat identification using long read sequencing data. Bioinformatics 2017; 34:1099-1107. [DOI: 10.1093/bioinformatics/btx717] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Accepted: 11/04/2017] [Indexed: 11/12/2022] Open
Affiliation(s)
- Rui Guo
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Yan-Ran Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| | - Shan He
- School of Computer Science, University of Birmingham, Birmingham, UK
- Centre for Computational Biology, School of Biosciences, University of Birmingham, Birmingham, UK
| | - Le Ou-Yang
- College of Information Science, Shenzhen University, Shenzhen, China
| | - Yiwen Sun
- School of Medicine, Shenzhen University, Shenzhen, China
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
| |
Collapse
|
46
|
Ou-Yang L, Zhang XF, Wu M, Li XL. Node-based learning of differential networks from multi-platform gene expression data. Methods 2017; 129:41-49. [DOI: 10.1016/j.ymeth.2017.05.014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2017] [Revised: 04/11/2017] [Accepted: 05/18/2017] [Indexed: 01/07/2023] Open
|
47
|
Zhang XF, Ou-Yang L, Yan H. Node-based differential network analysis in genomics. Comput Biol Chem 2017; 69:194-201. [DOI: 10.1016/j.compbiolchem.2017.03.010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Accepted: 03/27/2017] [Indexed: 12/26/2022]
|
48
|
Abstract
Exploring how the structure of a gene regulatory network differs between two different disease states is fundamental for understanding the biological mechanisms behind disease development and progression. Recently, with rapid advances in microarray technologies, gene expression profiles of the same patients can be collected from multiple microarray platforms. However, previous differential network analysis methods were usually developed based on a single type of platform, which could not utilize the common information shared across different platforms. In this study, we introduce a multi-view differential network analysis model to infer the differential network between two different patient groups based on gene expression profiles collected from multiple platforms. Unlike previous differential network analysis models that need to analyze each platform separately, our model can draw support from multiple data platforms to jointly estimate the differential networks and produce more accurate and reliable results. Our simulation studies demonstrate that our method consistently outperforms other available differential network analysis methods. We also applied our method to identify network rewiring associated with platinum resistance using TCGA ovarian cancer samples. The experimental results demonstrate that the hub genes in our identified differential networks on the PI3K/AKT/mTOR pathway play an important role in drug resistance.
Collapse
Affiliation(s)
- Le Ou-Yang
- College of Information Engineering, Shenzhen University, Shenzhen, China and Department of Electronic and Engineering, City University of Hong Kong, Hong Kong, China
| | - Hong Yan
- Department of Electronic and Engineering, City University of Hong Kong, Hong Kong, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China.
| |
Collapse
|
49
|
Wu M, Ou-Yang L, Li XL. Protein Complex Detection via Effective Integration of Base Clustering Solutions and Co-Complex Affinity Scores. IEEE/ACM Trans Comput Biol Bioinform 2017; 14:733-739. [PMID: 27071190 DOI: 10.1109/tcbb.2016.2552176] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
With the increasing availability of protein interaction data, various computational methods have been developed to predict protein complexes. However, different computational methods may have their own advantages and limitations. Ensemble clustering has thus been studied to minimize the potential bias and risk of individual methods and generate prediction results with better coverage and accuracy. In this paper, we extend the traditional ensemble clustering by taking into account the co-complex affinity scores and present an Ensemble H ierarchical Clustering framework (EnsemHC) to detect protein complexes. First, we construct co-cluster matrices by integrating the clustering results with the co-complex evidences. Second, we sum up the constructed co-cluster matrices to derive a final ensemble matrix via a novel iterative weighting scheme. Finally, we apply the hierarchical clustering to generate protein complexes from the final ensemble matrix. Experimental results demonstrate that our EnsemHC performs better than its base clustering methods and various existing integrative methods. In addition, we also observed that integrating the clusters and co-complex affinity scores from different data sources will improve the prediction performance, e.g., integrating the clusters from TAP data and co-complex affinities from binary PPI data achieved the best performance in our experiments.
Collapse
|
50
|
Zhang XF, Ou-Yang L, Yan H. Incorporating prior information into differential network analysis using non-paranormal graphical models. Bioinformatics 2017; 33:2436-2445. [DOI: 10.1093/bioinformatics/btx208] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2016] [Accepted: 04/05/2017] [Indexed: 02/02/2023] Open
Affiliation(s)
- Xiao-Fei Zhang
- Department of Statistics, School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Wuhan, China
- Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China
| | - Le Ou-Yang
- Department of Electronic Engineering, College of Information Engineering, Shenzhen University, Shenzhen, China
| | - Hong Yan
- Department of Electronic Engineering, City University of Hong Kong, Hong Kong, China
| |
Collapse
|