1
|
Tappu R, Haas J, Lehmann DH, Sedaghat-Hamedani F, Kayvanpour E, Keller A, Katus HA, Frey N, Meder B. Multi-omics assessment of dilated cardiomyopathy using non-negative matrix factorization. PLoS One 2022; 17:e0272093. [PMID: 35980883 PMCID: PMC9387871 DOI: 10.1371/journal.pone.0272093] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 07/11/2022] [Indexed: 11/19/2022] Open
Abstract
Dilated cardiomyopathy (DCM), a myocardial disease, is heterogeneous and often results in heart failure and sudden cardiac death. Unavailability of cardiac tissue has hindered the comprehensive exploration of gene regulatory networks and nodal players in DCM. In this study, we carried out integrated analysis of transcriptome and methylome data using non-negative matrix factorization from a cohort of DCM patients to uncover underlying latent factors and covarying features between whole-transcriptome and epigenome omics datasets from tissue biopsies of living patients. DNA methylation data from Infinium HM450 and mRNA Illumina sequencing of n = 33 DCM and n = 24 control probands were filtered, analyzed and used as input for matrix factorization using R NMF package. Mann-Whitney U test showed 4 out of 5 latent factors are significantly different between DCM and control probands (P<0.05). Characterization of top 10% features driving each latent factor showed a significant enrichment of biological processes known to be involved in DCM pathogenesis, including immune response (P = 3.97E-21), nucleic acid binding (P = 1.42E-18), extracellular matrix (P = 9.23E-14) and myofibrillar structure (P = 8.46E-12). Correlation network analysis revealed interaction of important sarcomeric genes like Nebulin, Tropomyosin alpha-3 and ERC-protein 2 with CpG methylation of ATPase Phospholipid Transporting 11A0, Solute Carrier Family 12 Member 7 and Leucine Rich Repeat Containing 14B, all with significant P values associated with correlation coefficients >0.7. Using matrix factorization, multi-omics data derived from human tissue samples can be integrated and novel interactions can be identified. Hypothesis generating nature of such analysis could help to better understand the pathophysiology of complex traits such as DCM.
Collapse
Affiliation(s)
- Rewati Tappu
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Jan Haas
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - David H. Lehmann
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
| | - Farbod Sedaghat-Hamedani
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Elham Kayvanpour
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Andreas Keller
- Department of Clinical Bioinformatics, Medical Faculty, Saarland University, Saarbrücken, Germany
| | - Hugo A. Katus
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Norbert Frey
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
| | - Benjamin Meder
- Institute for Cardiomyopathies Heidelberg (ICH), Heart Center Heidelberg, University of Heidelberg, Heidelberg, Germany
- DZHK (German Center for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Mannheim, Germany
- Department of Genetics, Stanford University School of Medicine, Palo Alto, California, United States of America
| |
Collapse
|
2
|
Abstract
Background Matrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Matrix factorization learns a latent data model that takes a data matrix and transforms it into a latent feature space enabling generalization, noise removal and feature discovery. However, factorization algorithms are numerically intensive, and hence there is a pressing challenge to scale current algorithms to work with large datasets. Our focus in this paper is matrix tri-factorization, a popular method that is not limited by the assumption of standard matrix factorization about data residing in one latent space. Matrix tri-factorization solves this by inferring a separate latent space for each dimension in a data matrix, and a latent mapping of interactions between the inferred spaces, making the approach particularly suitable for biomedical data mining. Results We developed a block-wise approach for latent factor learning in matrix tri-factorization. The approach partitions a data matrix into disjoint submatrices that are treated independently and fed into a parallel factorization system. An appealing property of the proposed approach is its mathematical equivalence with serial matrix tri-factorization. In a study on large biomedical datasets we show that our approach scales well on multi-processor and multi-GPU architectures. On a four-GPU system we demonstrate that our approach can be more than 100-times faster than its single-processor counterpart. Conclusions A general approach for scaling non-negative matrix tri-factorization is proposed. The approach is especially useful parallel matrix factorization implemented in a multi-GPU environment. We expect the new approach will be useful in emerging procedures for latent factor analysis, notably for data integration, where many large data matrices need to be collectively factorized.
Collapse
|
3
|
Auerbach SS. In vivo Signatures of Genotoxic and Non-genotoxic Chemicals. TOXICOGENOMICS IN PREDICTIVE CARCINOGENICITY 2016. [DOI: 10.1039/9781782624059-00113] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
This chapter reviews the findings from a broad array of in vivo genomic studies with the goal of identifying a general signature of genotoxicity (GSG) that is indicative of exposure to genotoxic agents (i.e. agents that are active in either the bacterial mutagenesis and/or the in vivo micronucleus test). While the GSG has largely emerged from systematic studies of rat and mouse liver, its response is evident across a broad collection of genotoxic treatments that cover a variety of tissues and species. Pathway-based characterization of the GSG indicates that it is enriched with genes that are regulated by p53. In addition to the GSG, another pan-tissue signature related to bone marrow suppression (a common effect of genotoxic agent exposure) is reviewed. Overall, these signatures are quite effective in identifying genotoxic agents; however, there are situations where false positive findings can occur, for example when necrotizing doses of non-genotoxic soft electrophiles (e.g. thioacetamide) are used. For this reason specific suggestions for best practices for generating for use in the creation and application of in vivo genomic signatures are reviewed.
Collapse
Affiliation(s)
- Scott S. Auerbach
- Toxicoinformatic Group, Biomolecular Screening Branch, Division of the National Toxicology Program, National Institute of Environmental Health Sciences PO Box 12233 MD K2-17 Research Triangle Park NC 27709 USA
| |
Collapse
|
4
|
Zhang X, Guan N, Jia Z, Qiu X, Luo Z. Semi-Supervised Projective Non-Negative Matrix Factorization for Cancer Classification. PLoS One 2015; 10:e0138814. [PMID: 26394323 PMCID: PMC4579132 DOI: 10.1371/journal.pone.0138814] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 09/03/2015] [Indexed: 01/23/2023] Open
Abstract
Advances in DNA microarray technologies have made gene expression profiles a significant candidate in identifying different types of cancers. Traditional learning-based cancer identification methods utilize labeled samples to train a classifier, but they are inconvenient for practical application because labels are quite expensive in the clinical cancer research community. This paper proposes a semi-supervised projective non-negative matrix factorization method (Semi-PNMF) to learn an effective classifier from both labeled and unlabeled samples, thus boosting subsequent cancer classification performance. In particular, Semi-PNMF jointly learns a non-negative subspace from concatenated labeled and unlabeled samples and indicates classes by the positions of the maximum entries of their coefficients. Because Semi-PNMF incorporates statistical information from the large volume of unlabeled samples in the learned subspace, it can learn more representative subspaces and boost classification performance. We developed a multiplicative update rule (MUR) to optimize Semi-PNMF and proved its convergence. The experimental results of cancer classification for two multiclass cancer gene expression profile datasets show that Semi-PNMF outperforms the representative methods.
Collapse
Affiliation(s)
- Xiang Zhang
- College of Computer, National University of Defense Technology, Changsha 410073, China
- National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China
| | - Naiyang Guan
- College of Computer, National University of Defense Technology, Changsha 410073, China
- National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China
- * E-mail: (NG); (ZL)
| | - Zhilong Jia
- Department of Chemistry and Biology, College of Science, National University of Defense Technology, Changsha, Hunan, China
| | - Xiaogang Qiu
- College of Information System and Management, National University of Defense Technology, Changsha, Hunan, 410073 China
| | - Zhigang Luo
- College of Computer, National University of Defense Technology, Changsha 410073, China
- National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China
- * E-mail: (NG); (ZL)
| |
Collapse
|
5
|
Jia Z, Zhang X, Guan N, Bo X, Barnes MR, Luo Z. Gene Ranking of RNA-Seq Data via Discriminant Non-Negative Matrix Factorization. PLoS One 2015; 10:e0137782. [PMID: 26348772 PMCID: PMC4562600 DOI: 10.1371/journal.pone.0137782] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Accepted: 07/28/2015] [Indexed: 02/06/2023] Open
Abstract
RNA-sequencing is rapidly becoming the method of choice for studying the full complexity of transcriptomes, however with increasing dimensionality, accurate gene ranking is becoming increasingly challenging. This paper proposes an accurate and sensitive gene ranking method that implements discriminant non-negative matrix factorization (DNMF) for RNA-seq data. To the best of our knowledge, this is the first work to explore the utility of DNMF for gene ranking. When incorporating Fisher’s discriminant criteria and setting the reduced dimension as two, DNMF learns two factors to approximate the original gene expression data, abstracting the up-regulated or down-regulated metagene by using the sample label information. The first factor denotes all the genes’ weights of two metagenes as the additive combination of all genes, while the second learned factor represents the expression values of two metagenes. In the gene ranking stage, all the genes are ranked as a descending sequence according to the differential values of the metagene weights. Leveraging the nature of NMF and Fisher’s criterion, DNMF can robustly boost the gene ranking performance. The Area Under the Curve analysis of differential expression analysis on two benchmarking tests of four RNA-seq data sets with similar phenotypes showed that our proposed DNMF-based gene ranking method outperforms other widely used methods. Moreover, the Gene Set Enrichment Analysis also showed DNMF outweighs others. DNMF is also computationally efficient, substantially outperforming all other benchmarked methods. Consequently, we suggest DNMF is an effective method for the analysis of differential gene expression and gene ranking for RNA-seq data.
Collapse
Affiliation(s)
- Zhilong Jia
- Department of Chemistry and Biology, College of Science, National University of Defense Technology, Changsha, Hunan, P.R. China
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Xiang Zhang
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, P.R. China
| | - Naiyang Guan
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, P.R. China
| | - Xiaochen Bo
- Beijing Institute of Radiation Medicine, Beijing, P.R. China
| | - Michael R. Barnes
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- * E-mail: (MRB); (ZL)
| | - Zhigang Luo
- Science and Technology on Parallel and Distributed Processing Laboratory, College of Computer, National University of Defense Technology, Changsha, Hunan, P.R. China
- * E-mail: (MRB); (ZL)
| |
Collapse
|
6
|
Žitnik M, Zupan B. Data Imputation in Epistatic MAPs by Network-Guided Matrix Completion. J Comput Biol 2015; 22:595-608. [PMID: 25658751 DOI: 10.1089/cmb.2014.0158] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Epistatic miniarray profile (E-MAP) is a popular large-scale genetic interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values. We introduce a new interaction data imputation method called network-guided matrix completion (NG-MC). The core part of NG-MC is low-rank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NG-MC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NG-MC depend on the profiles of their direct neighbors in gene networks. As the NG-MC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction. In a study with four different E-MAP data assays and considered protein-protein interaction and gene ontology similarity networks, NG-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NG-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.
Collapse
Affiliation(s)
- Marinka Žitnik
- 1Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Blaž Zupan
- 1Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.,2Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas
| |
Collapse
|
7
|
Guan N, Zhang X, Luo Z, Tao D, Yang X. Discriminant projective non-negative matrix factorization. PLoS One 2014; 8:e83291. [PMID: 24376680 PMCID: PMC3869764 DOI: 10.1371/journal.pone.0083291] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2013] [Accepted: 11/12/2013] [Indexed: 11/24/2022] Open
Abstract
Projective non-negative matrix factorization (PNMF) projects high-dimensional non-negative examples X onto a lower-dimensional subspace spanned by a non-negative basis W and considers WT X as their coefficients, i.e., X≈WWT X. Since PNMF learns the natural parts-based representation Wof X, it has been widely used in many fields such as pattern recognition and computer vision. However, PNMF does not perform well in classification tasks because it completely ignores the label information of the dataset. This paper proposes a Discriminant PNMF method (DPNMF) to overcome this deficiency. In particular, DPNMF exploits Fisher's criterion to PNMF for utilizing the label information. Similar to PNMF, DPNMF learns a single non-negative basis matrix and needs less computational burden than NMF. In contrast to PNMF, DPNMF maximizes the distance between centers of any two classes of examples meanwhile minimizes the distance between any two examples of the same class in the lower-dimensional subspace and thus has more discriminant power. We develop a multiplicative update rule to solve DPNMF and prove its convergence. Experimental results on four popular face image datasets confirm its effectiveness comparing with the representative NMF and PNMF algorithms.
Collapse
Affiliation(s)
- Naiyang Guan
- National Laboratory for Parallel and Distributed Processing, School of Computer Science, National University of Defense Technology, Changsha, Hunan, China
| | - Xiang Zhang
- National Laboratory for Parallel and Distributed Processing, School of Computer Science, National University of Defense Technology, Changsha, Hunan, China
| | - Zhigang Luo
- National Laboratory for Parallel and Distributed Processing, School of Computer Science, National University of Defense Technology, Changsha, Hunan, China
- * E-mail: (ZL); (DT)
| | - Dacheng Tao
- Centre for Quantum Computation and Intelligent Systems and the Faculty of Engineering and Information Technology, University of Technology, Sydney, Sydney, New South Wales, Australia
- * E-mail: (ZL); (DT)
| | - Xuejun Yang
- State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, Hunan, China
| |
Collapse
|
8
|
de Campos CP, Rancoita PMV, Kwee I, Zucca E, Zaffalon M, Bertoni F. Discovering subgroups of patients from DNA copy number data using NMF on compacted matrices. PLoS One 2013; 8:e79720. [PMID: 24278162 PMCID: PMC3835832 DOI: 10.1371/journal.pone.0079720] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2013] [Accepted: 10/04/2013] [Indexed: 01/28/2023] Open
Abstract
In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.
Collapse
Affiliation(s)
- Cassio P. de Campos
- Dalle Molle Institute for Artificial Intelligence (IDSIA), Manno, Switzerland
- Lymphoma and Genomics Research Program, Institute of Oncology Research (IOR), Bellinzona, Switzerland
- * E-mail:
| | - Paola M. V. Rancoita
- Dalle Molle Institute for Artificial Intelligence (IDSIA), Manno, Switzerland
- Lymphoma and Genomics Research Program, Institute of Oncology Research (IOR), Bellinzona, Switzerland
- University Centre of Statistics for Biomedical Sciences (CUSSB), Vita-Salute San Raffaele University, Milan, Italy
| | - Ivo Kwee
- Dalle Molle Institute for Artificial Intelligence (IDSIA), Manno, Switzerland
- Lymphoma and Genomics Research Program, Institute of Oncology Research (IOR), Bellinzona, Switzerland
- Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Emanuele Zucca
- Lymphoma Unit, Oncology Institute of Southern Switzerland (IOSI), Bellinzona, Switzerland
| | - Marco Zaffalon
- Dalle Molle Institute for Artificial Intelligence (IDSIA), Manno, Switzerland
| | - Francesco Bertoni
- Lymphoma and Genomics Research Program, Institute of Oncology Research (IOR), Bellinzona, Switzerland
- Lymphoma Unit, Oncology Institute of Southern Switzerland (IOSI), Bellinzona, Switzerland
| |
Collapse
|
9
|
Rotival M, Petretto E. Leveraging gene co-expression networks to pinpoint the regulation of complex traits and disease, with a focus on cardiovascular traits. Brief Funct Genomics 2013; 13:66-78. [PMID: 23960099 DOI: 10.1093/bfgp/elt030] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Over the past decade, the number of genome-scale transcriptional datasets in publicly available databases has climbed to nearly one million, providing an unprecedented opportunity for extensive analyses of gene co-expression networks. In systems-genetic studies of complex diseases researchers increasingly focus on groups of highly interconnected genes within complex transcriptional networks (referred to as clusters, modules or subnetworks) to uncover specific molecular processes that can inform functional disease mechanisms and pathological pathways. Here, we outline the basic paradigms underlying gene co-expression network analysis and critically review the most commonly used computational methods. Finally, we discuss specific applications of network-based approaches to the study of cardiovascular traits, which highlight the power of integrated analyses of networks, genetic and gene-regulation data to elucidate the complex mechanisms underlying cardiovascular disease.
Collapse
Affiliation(s)
- Maxime Rotival
- MRC-Clinical Sciences Centre, Hammersmith Hospital Campus, Imperial College Centre for Translational and Experimental Medicine (ICTEM Building), Du Cane Road, London, W12 0NN UK. Tel.: + 44-020-8383-1468; Fax: +44-208-383-8577;
| | | |
Collapse
|