1
|
Wang Y, Guan T, Zhou G, Zhao H, Gao J. SOJNMF: Identifying Multidimensional Molecular Regulatory Modules by Sparse Orthogonality-Regularized Joint Non-Negative Matrix Factorization Algorithm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3695-3703. [PMID: 34546925 DOI: 10.1109/tcbb.2021.3114146] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Cancer is not only a very aggressive but also a very diverse disease. Recent advances in high-throughput omics technologies of cancer have enabled biomedical researchers to have more opportunities for studying its multi-level biological regulatory mechanism. However, there are few methods to explore the underlying mechanism of cancer by identifying its multidimensional molecular regulatory modules from the multidimensional omics data of cancer. In this paper, we propose a sparse orthogonality-regularized joint non-negative matrix factorization (SOJNMF) algorithm which can integratively analyze multidimensional omics data. This method can not only identify multidimensional molecular regulatory modules, but reduce the overlap rate of features among the multidimensional modules while ensuring the sparsity of the coefficient matrix after decomposition. Gene expression data, miRNA expression data and gene methylation data of liver cancer are integratively analyzed based on SOJNMF algorithm. Then, we obtain 238 multidimensional molecular regulatory modules. The results of permutation test indicate that different omics features within these modules are significantly correlated in statistics. Meanwhile, the results of functional enrichment analysis show that these multidimensional modules are significantly related to the underlying mechanism of the occurrence and development of liver cancer.
Collapse
|
2
|
Tappu R, Haas J, Lehmann DH, Sedaghat-Hamedani F, Kayvanpour E, Keller A, Katus HA, Frey N, Meder B. Multi-omics assessment of dilated cardiomyopathy using non-negative matrix factorization. PLoS One 2022; 17:e0272093. [PMID: 35980883 PMCID: PMC9387871 DOI: 10.1371/journal.pone.0272093] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 07/11/2022] [Indexed: 11/19/2022] Open
Abstract
Dilated cardiomyopathy (DCM), a myocardial disease, is heterogeneous and often results in heart failure and sudden cardiac death. Unavailability of cardiac tissue has hindered the comprehensive exploration of gene regulatory networks and nodal players in DCM. In this study, we carried out integrated analysis of transcriptome and methylome data using non-negative matrix factorization from a cohort of DCM patients to uncover underlying latent factors and covarying features between whole-transcriptome and epigenome omics datasets from tissue biopsies of living patients. DNA methylation data from Infinium HM450 and mRNA Illumina sequencing of n = 33 DCM and n = 24 control probands were filtered, analyzed and used as input for matrix factorization using R NMF package. Mann-Whitney U test showed 4 out of 5 latent factors are significantly different between DCM and control probands (P<0.05). Characterization of top 10% features driving each latent factor showed a significant enrichment of biological processes known to be involved in DCM pathogenesis, including immune response (P = 3.97E-21), nucleic acid binding (P = 1.42E-18), extracellular matrix (P = 9.23E-14) and myofibrillar structure (P = 8.46E-12). Correlation network analysis revealed interaction of important sarcomeric genes like Nebulin, Tropomyosin alpha-3 and ERC-protein 2 with CpG methylation of ATPase Phospholipid Transporting 11A0, Solute Carrier Family 12 Member 7 and Leucine Rich Repeat Containing 14B, all with significant P values associated with correlation coefficients >0.7. Using matrix factorization, multi-omics data derived from human tissue samples can be integrated and novel interactions can be identified. Hypothesis generating nature of such analysis could help to better understand the pathophysiology of complex traits such as DCM.
Collapse
Affiliation(s)
- Rewati Tappu
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Jan Haas
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - David H. Lehmann
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
| | - Farbod Sedaghat-Hamedani
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Elham Kayvanpour
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Andreas Keller
- Department of Clinical Bioinformatics, Medical Faculty, Saarland University, Saarbrücken, Germany
| | - Hugo A. Katus
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Norbert Frey
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Benjamin Meder
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
- Department of Genetics, Stanford University School of Medicine, Palo Alto, California, United States of America
| |
Collapse
|
3
|
Hamamoto R, Takasawa K, Machino H, Kobayashi K, Takahashi S, Bolatkan A, Shinkai N, Sakai A, Aoyama R, Yamada M, Asada K, Komatsu M, Okamoto K, Kameoka H, Kaneko S. Application of non-negative matrix factorization in oncology: one approach for establishing precision medicine. Brief Bioinform 2022; 23:6628783. [PMID: 35788277 PMCID: PMC9294421 DOI: 10.1093/bib/bbac246] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 05/06/2022] [Accepted: 05/25/2022] [Indexed: 12/19/2022] Open
Abstract
The increase in the expectations of artificial intelligence (AI) technology has led to machine learning technology being actively used in the medical field. Non-negative matrix factorization (NMF) is a machine learning technique used for image analysis, speech recognition, and language processing; recently, it is being applied to medical research. Precision medicine, wherein important information is extracted from large-scale medical data to provide optimal medical care for every individual, is considered important in medical policies globally, and the application of machine learning techniques to this end is being handled in several ways. NMF is also introduced differently because of the characteristics of its algorithms. In this review, the importance of NMF in the field of medicine, with a focus on the field of oncology, is described by explaining the mathematical science of NMF and the characteristics of the algorithm, providing examples of how NMF can be used to establish precision medicine, and presenting the challenges of NMF. Finally, the direction regarding the effective use of NMF in the field of oncology is also discussed.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Rina Aoyama
- Showa University Graduate School of Medicine School of Medicine
| | | | - Ken Asada
- RIKEN Center for Advanced Intelligence Project
| | | | | | | | | |
Collapse
|
4
|
Sienkiewicz K, Chen J, Chatrath A, Lawson JT, Sheffield NC, Zhang L, Ratan A. Detecting molecular subtypes from multi-omics datasets using SUMO. CELL REPORTS METHODS 2022; 2:100152. [PMID: 35211690 PMCID: PMC8865426 DOI: 10.1016/j.crmeth.2021.100152] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 08/27/2021] [Accepted: 12/21/2021] [Indexed: 12/31/2022]
Abstract
We present a data integration framework that uses non-negative matrix factorization of patient-similarity networks to integrate continuous multi-omics datasets for molecular subtyping. It is demonstrated to have the capability to handle missing data without using imputation and to be consistently among the best in detecting subtypes with differential prognosis and enrichment of clinical associations in a large number of cancers. When applying the approach to data from individuals with lower-grade gliomas, we identify a subtype with a significantly worse prognosis. Tumors assigned to this subtype are hypomethylated genome wide with a gain of AP-1 occupancy in demethylated distal enhancers. The tumors are also enriched for somatic chromosome 7 (chr7) gain, chr10 loss, and other molecular events that have been suggested as diagnostic markers for "IDH wild type, with molecular features of glioblastoma" by the cIMPACT-NOW consortium but have yet to be included in the World Health Organization (WHO) guidelines.
Collapse
Affiliation(s)
- Karolina Sienkiewicz
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
| | - Jinyu Chen
- Department of Mathematics and Computational Biology Program, National University of Singapore, Singapore 119076, Singapore
| | - Ajay Chatrath
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA
| | - John T. Lawson
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Nathan C. Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
- University of Virginia Cancer Center, Charlottesville, VA 22908, USA
| | - Louxin Zhang
- Department of Mathematics and Computational Biology Program, National University of Singapore, Singapore 119076, Singapore
| | - Aakrosh Ratan
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908, USA
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22908, USA
- University of Virginia Cancer Center, Charlottesville, VA 22908, USA
| |
Collapse
|
5
|
Qiu Y, Ching WK, Zou Q. Matrix factorization-based data fusion for the prediction of RNA-binding proteins and alternative splicing event associations during epithelial-mesenchymal transition. Brief Bioinform 2021; 22:6354719. [PMID: 34410342 DOI: 10.1093/bib/bbab332] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 07/11/2021] [Accepted: 07/29/2021] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION The epithelial-mesenchymal transition (EMT) is a cellular-developmental process activated during tumor metastasis. Transcriptional regulatory networks controlling EMT are well studied; however, alternative RNA splicing also plays a critical regulatory role during this process. Unfortunately, a comprehensive understanding of alternative splicing (AS) and the RNA-binding proteins (RBPs) that regulate it during EMT remains largely unknown. Therefore, a great need exists to develop effective computational methods for predicting associations of RBPs and AS events. Dramatically increasing data sources that have direct and indirect information associated with RBPs and AS events have provided an ideal platform for inferring these associations. RESULTS In this study, we propose a novel method for RBP-AS target prediction based on weighted data fusion with sparse matrix tri-factorization (WDFSMF in short) that simultaneously decomposes heterogeneous data source matrices into low-rank matrices to reveal hidden associations. WDFSMF can select and integrate data sources by assigning different weights to those sources, and these weights can be assigned automatically. In addition, WDFSMF can identify significant RBP complexes regulating AS events and eliminate noise and outliers from the data. Our proposed method achieves an area under the receiver operating characteristic curve (AUC) of $90.78\%$, which shows that WDFSMF can effectively predict RBP-AS event associations with higher accuracy compared with previous methods. Furthermore, this study identifies significant RBPs as complexes for AS events during EMT and provides solid ground for further investigation into RNA regulation during EMT and metastasis. WDFSMF is a general data fusion framework, and as such it can also be adapted to predict associations between other biological entities.
Collapse
Affiliation(s)
- Yushan Qiu
- College of Mathematics and Statistics, Shenzhen University, 518000 Guangdong, China
| | - Wai-Ki Ching
- Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
6
|
Shiga M, Seno S, Onizuka M, Matsuda H. SC-JNMF: single-cell clustering integrating multiple quantification methods based on joint non-negative matrix factorization. PeerJ 2021; 9:e12087. [PMID: 34532161 PMCID: PMC8404576 DOI: 10.7717/peerj.12087] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Accepted: 08/07/2021] [Indexed: 11/20/2022] Open
Abstract
Single-cell RNA-sequencing is a rapidly evolving technology that enables us to understand biological processes at unprecedented resolution. Single-cell expression analysis requires a complex data processing pipeline, and the pipeline is divided into two main parts: The quantification part, which converts the sequence information into gene-cell matrix data; the analysis part, which analyzes the matrix data using statistics and/or machine learning techniques. In the analysis part, unsupervised cell clustering plays an important role in identifying cell types and discovering cell diversity and subpopulations. Identified cell clusters are also used for subsequent analysis, such as finding differentially expressed genes and inferring cell trajectories. However, single-cell clustering using gene expression profiles shows different results depending on the quantification methods. Clustering results are greatly affected by the quantification method used in the upstream process. In other words, even if the original RNA-sequence data is the same, gene expression profiles processed by different quantification methods will produce different clusters. In this article, we propose a robust and highly accurate clustering method based on joint non-negative matrix factorization (joint-NMF) by utilizing the information from multiple gene expression profiles quantified using different methods from the same RNA-sequence data. Our joint-NMF can extract common factors among multiple gene expression profiles by applying each NMF under the constraint that one of the factorized matrices is shared among multiple NMFs. The joint-NMF determines more robust and accurate cell clustering results by leveraging multiple quantification methods compared to conventional clustering methods, which use only a single gene expression profile. Additionally, we showed the usefulness of discovering marker genes with the extracted features using our method.
Collapse
Affiliation(s)
- Mikio Shiga
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Shigeto Seno
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Makoto Onizuka
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Hideo Matsuda
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| |
Collapse
|
7
|
Jiang X, Chen M, Song W, Lin GN. Label propagation-based semi-supervised feature selection on decoding clinical phenotypes with RNA-seq data. BMC Med Genomics 2021; 14:141. [PMID: 34465339 PMCID: PMC8406783 DOI: 10.1186/s12920-021-00985-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2020] [Accepted: 05/14/2021] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Clinically, behavior, cognitive, and mental functions are affected during the neurodegenerative disease progression. To date, the molecular pathogenesis of these complex disease is still unclear. With the rapid development of sequencing technologies, it is possible to delicately decode the molecular mechanisms corresponding to different clinical phenotypes at the genome-wide transcriptomic level using computational methods. Our previous studies have shown that it is difficult to distinguish disease genes from non-disease genes. Therefore, to precisely explore the molecular pathogenesis under complex clinical phenotypes, it is better to identify biomarkers corresponding to different disease stages or clinical phenotypes. So, in this study, we designed a label propagation-based semi-supervised feature selection approach (LPFS) to prioritize disease-associated genes corresponding to different disease stages or clinical phenotypes. METHODS In this study, we pioneering put label propagation clustering and feature selection into one framework and proposed label propagation-based semi-supervised feature selection approach. LPFS prioritizes disease genes related to different disease stages or phenotypes through the alternative iteration of label propagation clustering based on sample network and feature selection with gene expression profiles. Then the GO and KEGG pathway enrichment analysis were carried as well as the gene functional analysis to explore molecular mechanisms of specific disease phenotypes, thus to decode the changes in individual behavioral and mental characteristics during neurodegenerative disease progression. RESULTS Large amounts of experiments were conducted to verify the performance of LPFS with Huntington's gene expression data. Experimental results shown that LPFS performs better in comparison with the-state-of-art methods. GO and KEGG enrichment analysis of key gene sets shown that TGF-beta signaling pathway, cytokine-cytokine receptor interaction, immune response, and inflammatory response were gradually affected during the Huntington's disease progression. In addition, we found that the expression of SLC4A11, ZFP474, AMBP, TOP2A, PBK, CCDC33, APSL, DLGAP5, and Al662270 changed seriously by the development of the disease. CONCLUSIONS In this study, we designed a label propagation-based semi-supervised feature selection model to precisely selected key genes of different disease phenotypes. We conducted experiments using the model with Huntington's disease mice gene expression data to decode the mechanisms of it. We found many cell types, including astrocyte, microglia, and GABAergic neuron, could be involved in the pathological process.
Collapse
Affiliation(s)
- Xue Jiang
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Miao Chen
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Weichen Song
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Guan Ning Lin
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai, 200030 China
| |
Collapse
|
8
|
Jiang X, Pan W, Chen M, Wang W, Song W, Lin GN. Integrative enrichment analysis of gene expression based on an artificial neuron. BMC Med Genomics 2021; 14:173. [PMID: 34433483 PMCID: PMC8386081 DOI: 10.1186/s12920-021-00988-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Accepted: 05/18/2021] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Huntington's disease is a kind of chronic progressive neurodegenerative disease with complex pathogenic mechanisms. To data, the pathogenesis of Huntington's disease is still not fully understood, and there has been no effective treatment. The rapid development of high-throughput sequencing technologies makes it possible to explore the molecular mechanisms at the transcriptome level. Our previous studies on Huntington's disease have shown that it is difficult to distinguish disease-associated genes from non-disease genes. Meanwhile, recent progress in bio-medicine shows that the molecular origin of chronic complex diseases may not exist in the diseased tissue, and differentially expressed genes between different tissues may be helpful to reveal the molecular origin of chronic diseases. Therefore, developing integrative analysis computational methods for the multi-tissues gene expression data, exploring the relationship between differentially expressed genes in different tissues and the disease, can greatly accelerate the molecular discovery process. METHODS For analysis of the intra- and inter- tissues' differentially expressed genes, we designed an integrative enrichment analysis method based on an artificial neuron (IEAAN). Firstly, we calculated the differential expression scores of genes which are seen as features of the corresponding gene, using fold-change approach with intra- and inter- tissues' gene expression data. Then, we weighted sum all the differential expression scores through a sigmoid function to get differential expression enrichment score. Finally, we ranked the genes according to the enrichment score. Top ranking genes are supposed to be the potential disease-associated genes. RESULTS In this study, we conducted large amounts of experiments to analyze the differentially expressed genes of intra- and inter- tissues. Experimental results showed that genes differentially expressed between different tissues are more likely to be Huntington's disease-associated genes. Five disease-associated genes were selected out in this study, two of which have been reported to be implicated in Huntington's disease. CONCLUSIONS We proposed a novel integrative enrichment analysis method based on artificial neuron (IEAAN), which displays better prediction precision of disease-associated genes in comparison with the state-of-the-art statistical-based methods. Our comprehensive evaluation suggests that genes differentially expressed between striatum and liver tissues of health individuals are more likely to be Huntington's disease-associated genes.
Collapse
Affiliation(s)
- Xue Jiang
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Weihao Pan
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Miao Chen
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Weidi Wang
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Weichen Song
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Guan Ning Lin
- Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai, 200030 China
| |
Collapse
|
9
|
Zhang J, Liu L, Xu T, Zhang W, Li J, Rao N, Le TD. Time to infer miRNA sponge modules. WILEY INTERDISCIPLINARY REVIEWS-RNA 2021; 13:e1686. [PMID: 34342388 DOI: 10.1002/wrna.1686] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Revised: 07/14/2021] [Accepted: 07/14/2021] [Indexed: 01/01/2023]
Abstract
Inferring competing endogenous RNA (ceRNA) or microRNA (miRNA) sponge modules is a challenging and meaningful task for revealing ceRNA regulation mechanism at the module level. Modules in this context refer to groups of miRNA sponges which have mutual competitions and act as functional units for achieving biological processes. The recent development of computational methods based on heterogeneous data provides a novel way to discern the competitive effects of miRNA sponges on human complex diseases. This article aims to provide a comprehensive perspective of miRNA sponge module discovery methods. We first review the publicly available databases of cancer-related miRNA sponges, as the miRNA sponges involved in human cancers contribute to the discovery of cancer-associated modules. Then we review the existing computational methods for inferring miRNA sponge modules. Furthermore, we conduct an assessment on the performance of the module discovery methods with the pan-cancer dataset, and the comparison study indicates that it is useful to infer biologically meaningful miRNA sponge modules by directly mapping heterogeneous data to the competitive modules. Finally, we discuss the future directions and associated challenges in developing in silico methods to infer miRNA sponge modules. This article is categorized under: RNA Interactions with Proteins and Other Molecules > Small Molecule-RNA Interactions Regulatory RNAs/RNAi/Riboswitches > Regulatory RNAs.
Collapse
Affiliation(s)
- Junpeng Zhang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.,School of Engineering, Dali University, Dali, Yunnan, China
| | - Lin Liu
- UniSA STEM, University of South Australia, Mawson Lakes, South Australia, Australia
| | - Taosheng Xu
- Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, Anhui, China
| | - Wu Zhang
- School of Agriculture and Biological Sciences, Dali University, Dali, Yunnan, China
| | - Jiuyong Li
- UniSA STEM, University of South Australia, Mawson Lakes, South Australia, Australia
| | - Nini Rao
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Thuc Duy Le
- UniSA STEM, University of South Australia, Mawson Lakes, South Australia, Australia
| |
Collapse
|
10
|
Preprocessing of Public RNA-Sequencing Datasets to Facilitate Downstream Analyses of Human Diseases. DATA 2021. [DOI: 10.3390/data6070075] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Publicly available RNA-sequencing (RNA-seq) data are a rich resource for elucidating the mechanisms of human disease; however, preprocessing these data requires considerable bioinformatic expertise and computational infrastructure. Analyzing multiple datasets with a consistent computational workflow increases the accuracy of downstream meta-analyses. This collection of datasets represents the human intracellular transcriptional response to disorders and diseases such as acute lymphoblastic leukemia (ALL), B-cell lymphomas, chronic obstructive pulmonary disease (COPD), colorectal cancer, lupus erythematosus; as well as infection with pathogens including Borrelia burgdorferi, hantavirus, influenza A virus, Middle East respiratory syndrome coronavirus (MERS-CoV), Streptococcus pneumoniae, respiratory syncytial virus (RSV), severe acute respiratory syndrome coronavirus (SARS-CoV), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). We calculated the statistically significant differentially expressed genes and Gene Ontology terms for all datasets. In addition, a subset of the datasets also includes results from splice variant analyses, intracellular signaling pathway enrichments as well as read mapping and quantification. All analyses were performed using well-established algorithms and are provided to facilitate future data mining activities, wet lab studies, and to accelerate collaboration and discovery.
Collapse
|
11
|
Frusque G, Borgnat P, Gonçalves P, Jung J. Semi-automatic Extraction of Functional Dynamic Networks Describing Patient's Epileptic Seizures. Front Neurol 2020; 11:579725. [PMID: 33362688 PMCID: PMC7759641 DOI: 10.3389/fneur.2020.579725] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2020] [Accepted: 09/08/2020] [Indexed: 11/24/2022] Open
Abstract
Intracranial electroencephalography (EEG) studies using stereotactic EEG (SEEG) have shown that during seizures, epileptic activity spreads across several anatomical regions from the seizure onset zone toward remote brain areas. A full and objective characterization of this patient-specific time-varying network is crucial for optimal surgical treatment. Functional connectivity (FC) analysis of SEEG signals recorded during seizures enables to describe the statistical relations between all pairs of recorded signals. However, extracting meaningful information from those large datasets is time consuming and requires high expertise. In the present study, we first propose a novel method named Brain-wide Time-varying Network Decomposition (BTND) to characterize the dynamic epileptogenic networks activated during seizures in individual patients recorded with SEEG electrodes. The method provides a number of pathological FC subgraphs with their temporal course of activation. The method can be applied to several seizures of the patient to extract reproducible subgraphs. Second, we compare the activated subgraphs obtained by the BTND method with visual interpretation of SEEG signals recorded in 27 seizures from nine different patients. As a whole, we found that activated subgraphs corresponded to brain regions involved during the course of the seizures and their time course was highly consistent with classical visual interpretation. We believe that the proposed method can complement the visual analysis of SEEG signals recorded during seizures by highlighting and characterizing the most significant parts of epileptic networks with their activation dynamics.
Collapse
Affiliation(s)
- Gaëtan Frusque
- Univ Lyon, Inria, CNRS, ENS de Lyon, UCB Lyon 1, LIP UMR 5668, Lyon, France
| | - Pierre Borgnat
- Univ Lyon, CNRS, ENS de Lyon, UCB Lyon 1, Laboratoire de Physique, UMR 5672, Lyon, France
| | - Paulo Gonçalves
- Univ Lyon, Inria, CNRS, ENS de Lyon, UCB Lyon 1, LIP UMR 5668, Lyon, France
| | - Julien Jung
- National Institute of Health and Medical Research U1028/National Center for Scientific Research, Mixed Unit of Research 5292, Lyon Neuroscience Research Center, Lyon, France.,Department of Functional Neurology and Epileptology, Member of the ERN EpiCARE Lyon University Hospital and Lyon 1 University, Lyon, France
| |
Collapse
|
12
|
Application of Multiblock Analysis on Small Metabolomic Multi-Tissue Dataset. Metabolites 2020; 10:metabo10070295. [PMID: 32709053 PMCID: PMC7407932 DOI: 10.3390/metabo10070295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 07/14/2020] [Accepted: 07/15/2020] [Indexed: 11/16/2022] Open
Abstract
Data integration has been proven to provide valuable information. The information extracted using data integration in the form of multiblock analysis can pinpoint both common and unique trends in the different blocks. When working with small multiblock datasets the number of possible integration methods is drastically reduced. To investigate the application of multiblock analysis in cases where one has a few number of samples and a lack of statistical power, we studied a small metabolomic multiblock dataset containing six blocks (i.e., tissue types), only including common metabolites. We used a single model multiblock analysis method called the joint and unique multiblock analysis (JUMBA) and compared it to a commonly used method, concatenated principal component analysis (PCA). These methods were used to detect trends in the dataset and identify underlying factors responsible for metabolic variations. Using JUMBA, we were able to interpret the extracted components and link them to relevant biological properties. JUMBA shows how the observations are related to one another, the stability of these relationships, and to what extent each of the blocks contribute to the components. These results indicate that multiblock methods can be useful even with a small number of samples.
Collapse
|
13
|
Hu J, Zeng T, Xia Q, Huang L, Zhang Y, Zhang C, Zeng Y, Liu H, Zhang S, Huang G, Wan W, Ding Y, Hu F, Yang C, Chen L, Wang W. Identification of Key Genes for the Ultrahigh Yield of Rice Using Dynamic Cross-tissue Network Analysis. GENOMICS, PROTEOMICS & BIOINFORMATICS 2020; 18:256-270. [PMID: 32736037 PMCID: PMC7801251 DOI: 10.1016/j.gpb.2019.11.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Revised: 08/26/2019] [Accepted: 11/08/2019] [Indexed: 11/29/2022]
Abstract
Significantly increasing crop yield is a major and worldwide challenge for food supply and security. It is well-known that rice cultivated at Taoyuan in Yunnan of China can produce the highest yield worldwide. Yet, the gene regulatory mechanism underpinning this ultrahigh yield has been a mystery. Here, we systematically collected the transcriptome data for seven key tissues at different developmental stages using rice cultivated both at Taoyuan as the case group and at another regular rice planting place Jinghong as the control group. We identified the top 24 candidate high-yield genes with their network modules from these well-designed datasets by developing a novel computational systems biology method, i.e., dynamic cross-tissue (DCT) network analysis. We used one of the candidate genes, OsSPL4, whose function was previously unknown, for gene editing experimental validation of the high yield, and confirmed that OsSPL4 significantly affects panicle branching and increases the rice yield. This study, which included extensive field phenotyping, cross-tissue systems biology analyses, and functional validation, uncovered the key genes and gene regulatory networks underpinning the ultrahigh yield of rice. The DCT method could be applied to other plant or animal systems if different phenotypes under various environments with the common genome sequences of the examined sample. DCT can be downloaded from https://github.com/ztpub/DCT.
Collapse
Affiliation(s)
- Jihong Hu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan 430072, China
| | - Tao Zeng
- CAS Key Laboratory of Systems Biology, Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China; Institute of Brain-Intelligence Technology, Zhangjiang Laboratory, Shanghai 201210, China
| | - Qiongmei Xia
- Institute of Food Crop of Yunnan Academy of Agricultural Sciences, Kunming 650205, China
| | - Liyu Huang
- School of Agriculture, Yunnan University, Kunming 650500, China
| | - Yesheng Zhang
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; BGI-Baoshan, Baoshan 678004, China
| | - Chuanchao Zhang
- CAS Key Laboratory of Systems Biology, Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Yan Zeng
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Hui Liu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China
| | - Shilai Zhang
- School of Agriculture, Yunnan University, Kunming 650500, China
| | - Guangfu Huang
- School of Agriculture, Yunnan University, Kunming 650500, China
| | - Wenting Wan
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Center for Ecological and Environmental Sciences, Northwestern Polytechnical University, Xi'an 710072, China
| | - Yi Ding
- State Key Laboratory of Hybrid Rice, College of Life Sciences, Wuhan University, Wuhan 430072, China
| | - Fengyi Hu
- School of Agriculture, Yunnan University, Kunming 650500, China.
| | - Congdang Yang
- Institute of Food Crop of Yunnan Academy of Agricultural Sciences, Kunming 650205, China.
| | - Luonan Chen
- CAS Key Laboratory of Systems Biology, Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China; Institute of Brain-Intelligence Technology, Zhangjiang Laboratory, Shanghai 201210, China; School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China.
| | - Wen Wang
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China; Center for Ecological and Environmental Sciences, Northwestern Polytechnical University, Xi'an 710072, China.
| |
Collapse
|
14
|
Wang J, Yang Z, Domeniconi C, Zhang X, Yu G. Cooperative driver pathway discovery via fusion of multi-relational data of genes, miRNAs and pathways. Brief Bioinform 2020; 22:1984-1999. [PMID: 32103253 DOI: 10.1093/bib/bbz167] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 12/13/2019] [Accepted: 12/29/2019] [Indexed: 12/19/2022] Open
Abstract
Discovering driver pathways is an essential step to uncover the molecular mechanism underlying cancer and to explore precise treatments for cancer patients. However, due to the difficulties of mapping genes to pathways and the limited knowledge about pathway interactions, most previous work focus on identifying individual pathways. In practice, two (or even more) pathways interplay and often cooperatively trigger cancer. In this study, we proposed a new approach called CDPathway to discover cooperative driver pathways. First, CDPathway introduces a driver impact quantification function to quantify the driver weight of each gene. CDPathway assumes that genes with larger weights contribute more to the occurrence of the target disease and identifies them as candidate driver genes. Next, it constructs a heterogeneous network composed of genes, miRNAs and pathways nodes based on the known intra(inter)-relations between them and assigns the quantified driver weights to gene-pathway and gene-miRNA relational edges. To transfer driver impacts of genes to pathway interaction pairs, CDPathway collaboratively factorizes the weighted adjacency matrices of the heterogeneous network to explore the latent relations between genes, miRNAs and pathways. After this, it reconstructs the pathway interaction network and identifies the pathway pairs with maximal interactive and driver weights as cooperative driver pathways. Experimental results on the breast, uterine corpus endometrial carcinoma and ovarian cancer data from The Cancer Genome Atlas show that CDPathway can effectively identify candidate driver genes [area under the receiver operating characteristic curve (AUROC) of $\geq $0.9] and reconstruct the pathway interaction network (AUROC of>0.9), and it uncovers much more known (potential) driver genes than other competitive methods. In addition, CDPathway identifies 150% more driver pathways and 60% more potential cooperative driver pathways than the competing methods. The code of CDPathway is available at http://mlda.swu.edu.cn/codes.php?name=CDPathway.
Collapse
Affiliation(s)
- Jun Wang
- Professor of the School of Software, Shandong University
| | - Ziying Yang
- Professor of the School of Software, Shandong University
| | | | - Xiangliang Zhang
- Computational Bioscience Research Center (CBRC), Computer Science, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology, SA
| | - Guoxian Yu
- Computational Bioscience Research Center (CBRC), Computer Science, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology, SA.,Professor of the School of Software, Shandong University and Computational Bioscience Research Center
| |
Collapse
|
15
|
Luo Y, Mao C, Yang Y, Wang F, Ahmad FS, Arnett D, Irvin MR, Shah SJ. Integrating hypertension phenotype and genotype with hybrid non-negative matrix factorization. Bioinformatics 2020; 35:1395-1403. [PMID: 30239588 DOI: 10.1093/bioinformatics/bty804] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Revised: 08/20/2018] [Accepted: 09/13/2018] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION Hypertension is a heterogeneous syndrome in need of improved subtyping using phenotypic and genetic measurements with the goal of identifying subtypes of patients who share similar pathophysiologic mechanisms and may respond more uniformly to targeted treatments. Existing machine learning approaches often face challenges in integrating phenotype and genotype information and presenting to clinicians an interpretable model. We aim to provide informed patient stratification based on phenotype and genotype features. RESULTS In this article, we present a hybrid non-negative matrix factorization (HNMF) method to integrate phenotype and genotype information for patient stratification. HNMF simultaneously approximates the phenotypic and genetic feature matrices using different appropriate loss functions, and generates patient subtypes, phenotypic groups and genetic groups. Unlike previous methods, HNMF approximates phenotypic matrix under Frobenius loss, and genetic matrix under Kullback-Leibler (KL) loss. We propose an alternating projected gradient method to solve the approximation problem. Simulation shows HNMF converges fast and accurately to the true factor matrices. On a real-world clinical dataset, we used the patient factor matrix as features and examined the association of these features with indices of cardiac mechanics. We compared HNMF with six different models using phenotype or genotype features alone, with or without NMF, or using joint NMF with only one type of loss We also compared HNMF with 3 recently published methods for integrative clustering analysis, including iClusterBayes, Bayesian joint analysis and JIVE. HNMF significantly outperforms all comparison models. HNMF also reveals intuitive phenotype-genotype interactions that characterize cardiac abnormalities. AVAILABILITY AND IMPLEMENTATION Our code is publicly available on github at https://github.com/yuanluo/hnmf. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Chengsheng Mao
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Yiben Yang
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Fei Wang
- Department of Healthcare Policy & Research, Weill Cornell Medicine, Cornell University New York, NY, USA
| | - Faraz S Ahmad
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Donna Arnett
- Department of Epidemiology, College of Public Health, University of Kentucky, Lexington, KY, USA
| | - Marguerite R Irvin
- Department of Epidemiology, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Sanjiv J Shah
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| |
Collapse
|
16
|
Wang Y, Yu G, Wang J, Fu G, Guo M, Domeniconi C. Weighted matrix factorization on multi-relational data for LncRNA-disease association prediction. Methods 2020; 173:32-43. [DOI: 10.1016/j.ymeth.2019.06.015] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Revised: 06/01/2019] [Accepted: 06/13/2019] [Indexed: 02/07/2023] Open
|
17
|
Güvenç Paltun B, Mamitsuka H, Kaski S. Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and network-based approaches. Brief Bioinform 2019; 22:346-359. [PMID: 31838491 PMCID: PMC7820853 DOI: 10.1093/bib/bbz153] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Revised: 11/01/2019] [Accepted: 11/04/2019] [Indexed: 12/17/2022] Open
Abstract
Predicting the response of cancer cell lines to specific drugs is one of the central problems in personalized medicine, where the cell lines show diverse characteristics. Researchers have developed a variety of computational methods to discover associations between drugs and cell lines, and improved drug sensitivity analyses by integrating heterogeneous biological data. However, choosing informative data sources and methods that can incorporate multiple sources efficiently is the challenging part of successful analysis in personalized medicine. The reason is that finding decisive factors of cancer and developing methods that can overcome the problems of integrating data, such as differences in data structures and data complexities, are difficult. In this review, we summarize recent advances in data integration-based machine learning for drug response prediction, by categorizing methods as matrix factorization-based, kernel-based and network-based methods. We also present a short description of relevant databases used as a benchmark in drug response prediction analyses, followed by providing a brief discussion of challenges faced in integrating and interpreting data from multiple sources. Finally, we address the advantages of combining multiple heterogeneous data sources on drug sensitivity analysis by showing an experimental comparison. Contact: betul.guvenc@aalto.fi
Collapse
Affiliation(s)
- Betül Güvenç Paltun
- Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Helsinki, Finland
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
| | - Samuel Kaski
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
| |
Collapse
|
18
|
Kim Y, Bismeijer T, Zwart W, Wessels LFA, Vis DJ. Genomic data integration by WON-PARAFAC identifies interpretable factors for predicting drug-sensitivity in vivo. Nat Commun 2019; 10:5034. [PMID: 31695042 PMCID: PMC6834616 DOI: 10.1038/s41467-019-13027-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Accepted: 10/10/2019] [Indexed: 01/20/2023] Open
Abstract
Integrative analyses that summarize and link molecular data to treatment sensitivity are crucial to capture the biological complexity which is essential to further precision medicine. We introduce Weighted Orthogonal Nonnegative parallel factor analysis (WON-PARAFAC), a data integration method that identifies sparse and interpretable factors. WON-PARAFAC summarizes the GDSC1000 cell line compendium in 130 factors. We interpret the factors based on their association with recurrent molecular alterations, pathway enrichment, cancer type, and drug-response. Crucially, the cell line derived factors capture the majority of the relevant biological variation in Patient-Derived Xenograft (PDX) models, strongly suggesting our factors capture invariant and generalizable aspects of cancer biology. Furthermore, drug response in cell lines is better and more consistently translated to PDXs using factor-based predictors as compared to raw feature-based predictors. WON-PARAFAC efficiently summarizes and integrates multiway high-dimensional genomic data and enhances translatability of drug response prediction from cell lines to patient-derived xenografts.
Collapse
Affiliation(s)
- Yongsoo Kim
- Division of Oncogenomics, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands.,Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands.,Department of Pathology, VU University Medical Center, Amsterdam, The Netherlands
| | - Tycho Bismeijer
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Wilbert Zwart
- Division of Oncogenomics, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands. .,Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands.
| | - Lodewyk F A Wessels
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands. .,Faculty of EEMCS, Delft University of Technology, Delft, The Netherlands.
| | - Daniel J Vis
- Division of Molecular Carcinogenesis, Oncode Institute, The Netherlands Cancer Institute, Amsterdam, The Netherlands.
| |
Collapse
|
19
|
Jiang X, Zhang H, Zhang Z, Quan X. Flexible Non-Negative Matrix Factorization to Unravel Disease-Related Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1948-1957. [PMID: 29993985 DOI: 10.1109/tcbb.2018.2823746] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Recently, non-negative matrix factorization (NMF) has been shown to perform well in the analysis of omics data. NMF assumes that the expression level of one gene is a linear additive composition of metagenes. The elements in metagene matrix represent the regulation effects and are restricted to non-negativity. However, according to the real biological meaning, there are two kinds of regulation effects, i.e., up-regulation and down-regulation. Few methods based on NMF have considered this biological meaning. Therefore, we designed a flexible non-negative matrix factorization (FNMF) algorithm by further considering the biological meaning of gene expression data. It allows negative numbers in the metagene matrix, and negative numbers represent down-regulation effects. We separated gene expression data into disease-driven gene expression and background gene expression. Subsequently, we computed disease-driven gene relative expression, and a ranked list of genes was obtained. The top ranked genes are considered to be involved in some disease-related biological processes. Experimental results on two real-world gene expression data demonstrate the feasibility and effectiveness of FNMF. Compared with conventional disease-related gene identification algorithms, FNMF has superior performance in analyzing gene expression data of diseases with complex pathology.
Collapse
|
20
|
Fu G, Wang J, Domeniconi C, Yu G. Matrix factorization-based data fusion for the prediction of lncRNA-disease associations. Bioinformatics 2019; 34:1529-1537. [PMID: 29228285 DOI: 10.1093/bioinformatics/btx794] [Citation(s) in RCA: 126] [Impact Index Per Article: 25.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2017] [Accepted: 12/05/2017] [Indexed: 12/21/2022] Open
Abstract
Motivation Long non-coding RNAs (lncRNAs) play crucial roles in complex disease diagnosis, prognosis, prevention and treatment, but only a small portion of lncRNA-disease associations have been experimentally verified. Various computational models have been proposed to identify lncRNA-disease associations by integrating heterogeneous data sources. However, existing models generally ignore the intrinsic structure of data sources or treat them as equally relevant, while they may not be. Results To accurately identify lncRNA-disease associations, we propose a Matrix Factorization based LncRNA-Disease Association prediction model (MFLDA in short). MFLDA decomposes data matrices of heterogeneous data sources into low-rank matrices via matrix tri-factorization to explore and exploit their intrinsic and shared structure. MFLDA can select and integrate the data sources by assigning different weights to them. An iterative solution is further introduced to simultaneously optimize the weights and low-rank matrices. Next, MFLDA uses the optimized low-rank matrices to reconstruct the lncRNA-disease association matrix and thus to identify potential associations. In 5-fold cross validation experiments to identify verified lncRNA-disease associations, MFLDA achieves an area under the receiver operating characteristic curve (AUC) of 0.7408, at least 3% higher than those given by state-of-the-art data fusion based computational models. An empirical study on identifying masked lncRNA-disease associations again shows that MFLDA can identify potential associations more accurately than competing models. A case study on identifying lncRNAs associated with breast, lung and stomach cancers show that 38 out of 45 (84%) associations predicted by MFLDA are supported by recent biomedical literature and further proves the capability of MFLDA in identifying novel lncRNA-disease associations. MFLDA is a general data fusion framework, and as such it can be adopted to predict associations between other biological entities. Availability and implementation The source code for MFLDA is available at: http://mlda.swu.edu.cn/codes.php? name = MFLDA. Contact gxyu@swu.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Guangyuan Fu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Carlotta Domeniconi
- Department of Computer Science, George Mason University, Farifax, VA 22030, USA
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| |
Collapse
|
21
|
Siangphoe U, Archer KJ, Mukhopadhyay ND. Classical and Bayesian random-effects meta-analysis models with sample quality weights in gene expression studies. BMC Bioinformatics 2019; 20:18. [PMID: 30626315 PMCID: PMC6327440 DOI: 10.1186/s12859-018-2491-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Accepted: 11/12/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Random-effects (RE) models are commonly applied to account for heterogeneity in effect sizes in gene expression meta-analysis. The degree of heterogeneity may differ due to inconsistencies in sample quality. High heterogeneity can arise in meta-analyses containing poor quality samples. We applied sample-quality weights to adjust the study heterogeneity in the DerSimonian and Laird (DSL) and two-step DSL (DSLR2) RE models and the Bayesian random-effects (BRE) models with unweighted and weighted data, Gibbs and Metropolis-Hasting (MH) sampling algorithms, weighted common effect, and weighted between-study variance. We evaluated the performance of the models through simulations and illustrated application of the methods using Alzheimer's gene expression datasets. RESULTS Sample quality adjusting within study variance (wP6) models provided an appropriate reduction of differentially expressed (DE) genes compared to other weighted functions in classical RE models. The BRE model with a uniform(0,1) prior was appropriate for detecting DE genes as compared to the models with other prior distributions. The precision of DE gene detection in the heterogeneous data was increased with the DSLR2wP6 weighted model compared to the DSLwP6 weighted model. Among the BRE weighted models, the wP6weighted- and unweighted-data models and both Gibbs- and MH-based models performed similarly. The wP6 weighted common-effect model performed similarly to the unweighted model in the homogeneous data, but performed worse in the heterogeneous data. The wP6weighted data were appropriate for detecting DE genes with high precision, while the wP6weighted between-study variance models were appropriate for detecting DE genes with high overall accuracy. Without the weight, when the number of genes in microarray increased, the DSLR2 performed stably, while the overall accuracy of the BRE model was reduced. When applying the weighted models in the Alzheimer's gene expression data, the number of DE genes decreased in all metadata sets with the DSLR2wP6weighted and the wP6weighted between study variance models. Four hundred and forty-six DE genes identified by the wP6weighted between study variance model could be potentially down-regulated genes that may contribute to good classification of Alzheimer's samples. CONCLUSIONS The application of sample quality weights can increase precision and accuracy of the classical RE and BRE models; however, the performance of the models varied depending on data features, levels of sample quality, and adjustment of parameter estimates.
Collapse
Affiliation(s)
- Uma Siangphoe
- Office of Biostatistics, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, Maryland USA
| | - Kellie J. Archer
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, Ohio USA
| | - Nitai D. Mukhopadhyay
- Department of Biostatistics, Virginia Commonwealth University, Richmond, Virginia USA
| |
Collapse
|
22
|
Yu N, Gao YL, Liu JX, Shang J, Zhu R, Dai LY. Co-differential Gene Selection and Clustering Based on Graph Regularized Multi-View NMF in Cancer Genomic Data. Genes (Basel) 2018; 9:E586. [PMID: 30487464 PMCID: PMC6315625 DOI: 10.3390/genes9120586] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 11/13/2018] [Accepted: 11/26/2018] [Indexed: 12/19/2022] Open
Abstract
Cancer genomic data contain views from different sources that provide complementary information about genetic activity. This provides a new way for cancer research. Feature selection and multi-view clustering are hot topics in bioinformatics, and they can make full use of complementary information to improve the effect. In this paper, a novel integrated model called Multi-view Non-negative Matrix Factorization (MvNMF) is proposed for the selection of common differential genes (co-differential genes) and multi-view clustering. In order to encode the geometric information in the multi-view genomic data, graph regularized MvNMF (GMvNMF) is further proposed by applying the graph regularization constraint in the objective function. GMvNMF can not only obtain the potential shared feature structure and shared cluster group structure, but also capture the manifold structure of multi-view data. The validity of the proposed GMvNMF method was tested in four multi-view genomic data. Experimental results showed that the GMvNMF method has better performance than other representative methods.
Collapse
Affiliation(s)
- Na Yu
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao 276826, China.
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| | - Rong Zhu
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| | - Ling-Yun Dai
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| |
Collapse
|
23
|
Xie XP, Xie YF, Liu YT, Wang HQ. Adaptively capturing the heterogeneity of expression for cancer biomarker identification. BMC Bioinformatics 2018; 19:401. [PMID: 30390627 PMCID: PMC6215657 DOI: 10.1186/s12859-018-2437-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Accepted: 10/15/2018] [Indexed: 11/25/2022] Open
Abstract
Background Identifying cancer biomarkers from transcriptomics data is of importance to cancer research. However, transcriptomics data are often complex and heterogeneous, which complicates the identification of cancer biomarkers in practice. Currently, the heterogeneity still remains a challenge for detecting subtle but consistent changes of gene expression in cancer cells. Results In this paper, we propose to adaptively capture the heterogeneity of expression across samples in a gene regulation space instead of in a gene expression space. Specifically, we transform gene expression profiles into gene regulation profiles and mathematically formulate gene regulation probabilities (GRPs)-based statistics for characterizing differential expression of genes between tumor and normal tissues. Finally, an unbiased estimator (aGRP) of GRPs is devised that can interrogate and adaptively capture the heterogeneity of gene expression. We also derived an asymptotical significance analysis procedure for the new statistic. Since no parameter needs to be preset, aGRP is easy and friendly to use for researchers without computer programming background. We evaluated the proposed method on both simulated data and real-world data and compared with previous methods. Experimental results demonstrated the superior performance of the proposed method in exploring the heterogeneity of expression for capturing subtle but consistent alterations of gene expression in cancer. Conclusions Expression heterogeneity largely influences the performance of cancer biomarker identification from transcriptomics data. Models are needed that efficiently deal with the expression heterogeneity. The proposed method can be a standalone tool due to its capacity of adaptively capturing the sample heterogeneity and the simplicity in use. Software availability The source code of aGRP can be downloaded from https://github.com/hqwang126/aGRP. Electronic supplementary material The online version of this article (10.1186/s12859-018-2437-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xin-Ping Xie
- School of Mathematics and Physics, Anhui Jianzhu University, Hefei, 230022, Anhui, China
| | - Yu-Feng Xie
- School of Mathematics and Physics, Anhui Jianzhu University, Hefei, 230022, Anhui, China.,Institute of Intelligent Machines, Hefei Institutes of Physical Science, CAS, 350 Shushanhu Road, P.O.Box 1130, Hefei, 230031, Anhui, China.,Present Address: School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710100, China
| | - Yi-Tong Liu
- School of Mathematics and Physics, Anhui Jianzhu University, Hefei, 230022, Anhui, China.,Institute of Intelligent Machines, Hefei Institutes of Physical Science, CAS, 350 Shushanhu Road, P.O.Box 1130, Hefei, 230031, Anhui, China
| | - Hong-Qiang Wang
- Institute of Intelligent Machines, Hefei Institutes of Physical Science, CAS, 350 Shushanhu Road, P.O.Box 1130, Hefei, 230031, Anhui, China.
| |
Collapse
|
24
|
Guo X, Jiang X, Xu J, Quan X, Wu M, Zhang H. Ensemble Consensus-Guided Unsupervised Feature Selection to Identify Huntington's Disease-Associated Genes. Genes (Basel) 2018; 9:genes9070350. [PMID: 30002337 PMCID: PMC6071299 DOI: 10.3390/genes9070350] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 07/06/2018] [Accepted: 07/09/2018] [Indexed: 12/20/2022] Open
Abstract
Due to the complexity of the pathological mechanisms of neurodegenerative diseases, traditional differentially-expressed gene selection methods cannot detect disease-associated genes accurately. Recent studies have shown that consensus-guided unsupervised feature selection (CGUFS) performs well in feature selection for identifying disease-associated genes. Since the random initialization of the feature selection matrix in CGUFS results in instability of the final disease-associated gene set, for the purposes of this study we proposed an ensemble method based on CGUFS-namely, ensemble consensus-guided unsupervised feature selection (ECGUFS) in order to further improve the accuracy of disease-associated genes and the stability of feature gene sets. We also proposed a bagging integration strategy to integrate the results of CGUFS. Lastly, we conducted experiments with Huntington's disease RNA sequencing (RNA-Seq) data and obtained the final feature gene set, where we detected 287 disease-associated genes. Enrichment analysis on these genes has shown that postsynaptic density and the postsynaptic membrane, synapse, and cell junction are all affected during the disease's progression. However, ECGUFS greatly improved the accuracy of disease-associated gene prediction and the stability of the disease-associated gene set. We conducted a classification of samples with labels based on the linear support vector machine with 10-fold cross-validation. The average accuracy is 0.9, which suggests the effectiveness of the feature gene set.
Collapse
Affiliation(s)
- Xia Guo
- College of Computer and Control Engineering, Nankai University, Tianjin 300350, China.
| | - Xue Jiang
- College of Computer and Control Engineering, Nankai University, Tianjin 300350, China.
| | - Jing Xu
- College of Computer and Control Engineering, Nankai University, Tianjin 300350, China.
| | - Xiongwen Quan
- College of Computer and Control Engineering, Nankai University, Tianjin 300350, China.
| | - Min Wu
- College of Computer and Control Engineering, Nankai University, Tianjin 300350, China.
| | - Han Zhang
- College of Computer and Control Engineering, Nankai University, Tianjin 300350, China.
| |
Collapse
|
25
|
Jiang X, Zhang H, Duan F, Quan X. Identify Huntington's disease associated genes based on restricted Boltzmann machine with RNA-seq data. BMC Bioinformatics 2017; 18:447. [PMID: 29020921 PMCID: PMC5637347 DOI: 10.1186/s12859-017-1859-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Accepted: 10/02/2017] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Predicting disease-associated genes is helpful for understanding the molecular mechanisms during the disease progression. Since the pathological mechanisms of neurodegenerative diseases are very complex, traditional statistic-based methods are not suitable for identifying key genes related to the disease development. Recent studies have shown that the computational models with deep structure can learn automatically the features of biological data, which is useful for exploring the characteristics of gene expression during the disease progression. RESULTS In this paper, we propose a deep learning approach based on the restricted Boltzmann machine to analyze the RNA-seq data of Huntington's disease, namely stacked restricted Boltzmann machine (SRBM). According to the SRBM, we also design a novel framework to screen the key genes during the Huntington's disease development. In this work, we assume that the effects of regulatory factors can be captured by the hierarchical structure and narrow hidden layers of the SRBM. First, we select disease-associated factors with different time period datasets according to the differentially activated neurons in hidden layers. Then, we select disease-associated genes according to the changes of the gene energy in SRBM at different time periods. CONCLUSIONS The experimental results demonstrate that SRBM can detect the important information for differential analysis of time series gene expression datasets. The identification accuracy of the disease-associated genes is improved to some extent using the novel framework. Moreover, the prediction precision of disease-associated genes for top ranking genes using SRBM is effectively improved compared with that of the state of the art methods.
Collapse
Affiliation(s)
- Xue Jiang
- College of Computer and Control Engineering, Nankai University, Tongyan Road, Tianjin, 300350, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tongyan Road, Tianjin, 300350, China
| | - Han Zhang
- College of Computer and Control Engineering, Nankai University, Tongyan Road, Tianjin, 300350, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tongyan Road, Tianjin, 300350, China
| | - Feng Duan
- College of Computer and Control Engineering, Nankai University, Tongyan Road, Tianjin, 300350, China.,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tongyan Road, Tianjin, 300350, China
| | - Xiongwen Quan
- College of Computer and Control Engineering, Nankai University, Tongyan Road, Tianjin, 300350, China. .,Tianjin Key Laboratory of Intelligent Robotics, Nankai University, Tongyan Road, Tianjin, 300350, China.
| |
Collapse
|
26
|
Xie XP, Xie YF, Wang HQ. A regulation probability model-based meta-analysis of multiple transcriptomics data sets for cancer biomarker identification. BMC Bioinformatics 2017; 18:375. [PMID: 28830341 PMCID: PMC5568075 DOI: 10.1186/s12859-017-1794-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2017] [Accepted: 08/15/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Large-scale accumulation of omics data poses a pressing challenge of integrative analysis of multiple data sets in bioinformatics. An open question of such integrative analysis is how to pinpoint consistent but subtle gene activity patterns across studies. Study heterogeneity needs to be addressed carefully for this goal. RESULTS This paper proposes a regulation probability model-based meta-analysis, jGRP, for identifying differentially expressed genes (DEGs). The method integrates multiple transcriptomics data sets in a gene regulatory space instead of in a gene expression space, which makes it easy to capture and manage data heterogeneity across studies from different laboratories or platforms. Specifically, we transform gene expression profiles into a united gene regulation profile across studies by mathematically defining two gene regulation events between two conditions and estimating their occurring probabilities in a sample. Finally, a novel differential expression statistic is established based on the gene regulation profiles, realizing accurate and flexible identification of DEGs in gene regulation space. We evaluated the proposed method on simulation data and real-world cancer datasets and showed the effectiveness and efficiency of jGRP in identifying DEGs identification in the context of meta-analysis. CONCLUSIONS Data heterogeneity largely influences the performance of meta-analysis of DEGs identification. Existing different meta-analysis methods were revealed to exhibit very different degrees of sensitivity to study heterogeneity. The proposed method, jGRP, can be a standalone tool due to its united framework and controllable way to deal with study heterogeneity.
Collapse
Affiliation(s)
- Xin-Ping Xie
- School of Mathematics and Physics, Anhui Jianzhu University, Hefei, Anhui 230022 China
| | - Yu-Feng Xie
- School of Mathematics and Physics, Anhui Jianzhu University, Hefei, Anhui 230022 China
- Cancer Hospital, CAS, Hefei, Anhui 230031 China
| | - Hong-Qiang Wang
- Cancer Hospital, CAS, Hefei, Anhui 230031 China
- MICB Lab., Hefei Institutes of Physical Science, CAS, Hefei, 230031 China
| |
Collapse
|
27
|
Xie XP, Gan B, Yang W, Wang HQ. ctPath: Demixing pathway crosstalk effect from transcriptomics data for differential pathway identification. J Biomed Inform 2017; 73:104-114. [PMID: 28756161 DOI: 10.1016/j.jbi.2017.07.019] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Revised: 07/25/2017] [Accepted: 07/25/2017] [Indexed: 12/17/2022]
Abstract
Identifying differentially expressed pathways (DEPs) plays important roles in understanding tumor etiology and promoting clinical treatment of cancer or other diseases. By assuming gene expression to be a sparse non-negative linear combination of hidden pathway signals, we propose a pathway crosstalk-based transcriptomics data analysis method (ctPath) for identifying differentially expressed pathways. Biologically, pathways of different functions work in concert at the systematic level. The proposed method interrogates the crosstalks between pathways and discovers hidden pathway signals by mapping high-dimensional transcriptomics data into a low-dimensional pathway space. The resulted pathway signals reflect the activity level of pathways after removing pathway crosstalk effect and allow a robust identification of DEPs from inherently complex and noisy transcriptomics data. CtPath can also correct incomplete and inaccurate pathway annotations which frequently occur in public repositories. Experimental results on both simulation data and real-world cancer data demonstrate the superior performance of ctPath over other popular approaches. R code for ctPath is available for non-commercial use at the URL http://micblab.iim.ac.cn/Download/.
Collapse
Affiliation(s)
- Xin-Ping Xie
- School of Mathematics and Physics, Anhui Jianzhu University, Hefei, Anhui, China
| | - Bin Gan
- Biological Molecular Information System Lab., Institute of Intelligent Machines, Hefei Institutes of Physical Science, CAS, Hefei, Anhui, China
| | - Wulin Yang
- Center for Medical Physics and Technology, Hefei Institutes of Physical Science, CAS, Hefei, Anhui, China; Cancer Hospital, CAS, Hefei, Anhui, China
| | - Hong-Qiang Wang
- Biological Molecular Information System Lab., Institute of Intelligent Machines, Hefei Institutes of Physical Science, CAS, Hefei, Anhui, China; Center for Medical Physics and Technology, Hefei Institutes of Physical Science, CAS, Hefei, Anhui, China; Cancer Hospital, CAS, Hefei, Anhui, China.
| |
Collapse
|
28
|
Siangphoe U, Archer KJ. Estimation of random effects and identifying heterogeneous genes in meta-analysis of gene expression studies. Brief Bioinform 2017; 18:602-618. [PMID: 27345525 DOI: 10.1093/bib/bbw050] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2016] [Indexed: 11/12/2022] Open
Abstract
Combining effect sizes from individual studies using random-effects meta-analysis models are commonly applied in high-dimensional gene expression data. However, unknown study heterogeneity can arise from inconsistencies in sample quality and experimental conditions. High heterogeneity of effect sizes can reduce statistical power of the models. In this study, we describe three hypothesis-testing frameworks for meta-analysis of microarray data, and review several existing meta-analytic techniques that have been used in the genomic setting. These include P-value-based methods, rank-based methods and effect-size-based methods. We then discuss limitations of some of these methods and describe random-effects-based methods in detail. We introduce two methods for estimating the inter-study variance in random-effects meta-analytic models and another method for identifying heterogeneous genes for gene expression data. We compared various methods with the standard and existing meta-analytic techniques in the genomic framework. We demonstrate our results through a series of simulations and application in Alzheimer's gene expression data.
Collapse
|
29
|
Differentially Coexpressed Disease Gene Identification Based on Gene Coexpression Network. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3962761. [PMID: 28042568 PMCID: PMC5155124 DOI: 10.1155/2016/3962761] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 10/26/2016] [Indexed: 11/17/2022]
Abstract
Screening disease-related genes by analyzing gene expression data has become a popular theme. Traditional disease-related gene selection methods always focus on identifying differentially expressed gene between case samples and a control group. These traditional methods may not fully consider the changes of interactions between genes at different cell states and the dynamic processes of gene expression levels during the disease progression. However, in order to understand the mechanism of disease, it is important to explore the dynamic changes of interactions between genes in biological networks at different cell states. In this study, we designed a novel framework to identify disease-related genes and developed a differentially coexpressed disease-related gene identification method based on gene coexpression network (DCGN) to screen differentially coexpressed genes. We firstly constructed phase-specific gene coexpression network using time-series gene expression data and defined the conception of differential coexpression of genes in coexpression network. Then, we designed two metrics to measure the value of gene differential coexpression according to the change of local topological structures between different phase-specific networks. Finally, we conducted meta-analysis of gene differential coexpression based on the rank-product method. Experimental results demonstrated the feasibility and effectiveness of DCGN and the superior performance of DCGN over other popular disease-related gene selection methods through real-world gene expression data sets.
Collapse
|
30
|
Meng C, Zeleznik OA, Thallinger GG, Kuster B, Gholami AM, Culhane AC. Dimension reduction techniques for the integrative analysis of multi-omics data. Brief Bioinform 2016; 17:628-41. [PMID: 26969681 PMCID: PMC4945831 DOI: 10.1093/bib/bbv108] [Citation(s) in RCA: 196] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2015] [Revised: 10/26/2015] [Indexed: 01/16/2023] Open
Abstract
State-of-the-art next-generation sequencing, transcriptomics, proteomics and other high-throughput 'omics' technologies enable the efficient generation of large experimental data sets. These data may yield unprecedented knowledge about molecular pathways in cells and their role in disease. Dimension reduction approaches have been widely used in exploratory analysis of single omics data sets. This review will focus on dimension reduction approaches for simultaneous exploratory analyses of multiple data sets. These methods extract the linear relationships that best explain the correlated structure across data sets, the variability both within and between variables (or observations) and may highlight data issues such as batch effects or outliers. We explore dimension reduction techniques as one of the emerging approaches for data integration, and how these can be applied to increase our understanding of biological systems in normal physiological function and disease.
Collapse
|
31
|
Jia Z, Zhang X, Guan N, Bo X, Barnes MR, Luo Z. Gene Ranking of RNA-Seq Data via Discriminant Non-Negative Matrix Factorization. PLoS One 2015; 10:e0137782. [PMID: 26348772 PMCID: PMC4562600 DOI: 10.1371/journal.pone.0137782] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Accepted: 07/28/2015] [Indexed: 02/06/2023] Open
Abstract
RNA-sequencing is rapidly becoming the method of choice for studying the full complexity of transcriptomes, however with increasing dimensionality, accurate gene ranking is becoming increasingly challenging. This paper proposes an accurate and sensitive gene ranking method that implements discriminant non-negative matrix factorization (DNMF) for RNA-seq data. To the best of our knowledge, this is the first work to explore the utility of DNMF for gene ranking. When incorporating Fisher’s discriminant criteria and setting the reduced dimension as two, DNMF learns two factors to approximate the original gene expression data, abstracting the up-regulated or down-regulated metagene by using the sample label information. The first factor denotes all the genes’ weights of two metagenes as the additive combination of all genes, while the second learned factor represents the expression values of two metagenes. In the gene ranking stage, all the genes are ranked as a descending sequence according to the differential values of the metagene weights. Leveraging the nature of NMF and Fisher’s criterion, DNMF can robustly boost the gene ranking performance. The Area Under the Curve analysis of differential expression analysis on two benchmarking tests of four RNA-seq data sets with similar phenotypes showed that our proposed DNMF-based gene ranking method outperforms other widely used methods. Moreover, the Gene Set Enrichment Analysis also showed DNMF outweighs others. DNMF is also computationally efficient, substantially outperforming all other benchmarked methods. Consequently, we suggest DNMF is an effective method for the analysis of differential gene expression and gene ranking for RNA-seq data.
Collapse
Affiliation(s)
- Zhilong Jia
- Department of Chemistry and Biology, College of Science, National University of Defense Technology, Changsha, Hunan, P.R. China
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Xiang Zhang
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, P.R. China
| | - Naiyang Guan
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, P.R. China
| | - Xiaochen Bo
- Beijing Institute of Radiation Medicine, Beijing, P.R. China
| | - Michael R. Barnes
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- * E-mail: (MRB); (ZL)
| | - Zhigang Luo
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, P.R. China
- * E-mail: (MRB); (ZL)
| |
Collapse
|