1
|
Rasmussen M, Fredsøe J, Salachan PV, Blanke MPL, Larsen SH, Ulhøi BP, Jensen JB, Borre M, Sørensen KD. Stroma-specific gene expression signature identifies prostate cancer subtype with high recurrence risk. NPJ Precis Oncol 2024; 8:48. [PMID: 38395986 PMCID: PMC10891092 DOI: 10.1038/s41698-024-00540-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Accepted: 02/02/2024] [Indexed: 02/25/2024] Open
Abstract
Current prognostic tools cannot clearly distinguish indolent and aggressive prostate cancer (PC). We hypothesized that analyzing individual contributions of epithelial and stromal components in localized PC (LPC) could improve risk stratification, as stromal subtypes may have been overlooked due to the emphasis on malignant epithelial cells. Hence, we derived molecular subtypes of PC using gene expression analysis of LPC samples from prostatectomy patients (cohort 1, n = 127) and validated these subtypes in two independent prostatectomy cohorts (cohort 2, n = 406, cohort 3, n = 126). Stroma and epithelium-specific signatures were established from laser-capture microdissection data and non-negative matrix factorization was used to identify subtypes based on these signatures. Subtypes were functionally characterized by gene set and cell type enrichment analyses, and survival analysis was conducted. Three epithelial (E1-E3) and three stromal (S1-S3) PC subtypes were identified. While subtyping based on epithelial signatures showed inconsistent associations to biochemical recurrence (BCR), subtyping by stromal signatures was significantly associated with BCR in all three cohorts, with subtype S3 indicating high BCR risk. Subtype S3 exhibited distinct features, including significantly decreased cell-polarity and myogenesis, significantly increased infiltration of M2-polarized macrophages and CD8 + T-cells compared to subtype S1. For patients clinically classified as CAPRA-S intermediate risk, S3 improved prediction of BCR. This study demonstrates the potential of stromal signatures in identification of clinically relevant PC subtypes, and further indicated that stromal characterization may enhance risk stratification in LPC and may be particularly promising in cases with high prognostic ambiguity based on clinical parameters.
Collapse
Affiliation(s)
- Martin Rasmussen
- Department of Molecular Medicine, Aarhus University Hospital (AUH), Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | - Jacob Fredsøe
- Department of Molecular Medicine, Aarhus University Hospital (AUH), Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | - Paul Vinu Salachan
- Department of Molecular Medicine, Aarhus University Hospital (AUH), Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | - Marcus Pii Lunau Blanke
- Department of Molecular Medicine, Aarhus University Hospital (AUH), Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | - Stine Hesselby Larsen
- Department of Molecular Medicine, Aarhus University Hospital (AUH), Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | | | - Jørgen Bjerggaard Jensen
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
- Department of Urology, Gødstrup Hospital, Herning, Denmark
| | - Michael Borre
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
- Department of Urology, Aarhus University Hospital (AUH), Aarhus, Denmark
| | - Karina Dalsgaard Sørensen
- Department of Molecular Medicine, Aarhus University Hospital (AUH), Aarhus, Denmark.
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark.
| |
Collapse
|
2
|
Zhao L, Cunningham CM, Andruska AM, Schimmel K, Ali MK, Kim D, Gu S, Chang JL, Spiekerkoetter E, Nicolls MR. Rat microbial biogeography and age-dependent lactic acid bacteria in healthy lungs. Lab Anim (NY) 2024; 53:43-55. [PMID: 38297075 PMCID: PMC10834367 DOI: 10.1038/s41684-023-01322-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2022] [Accepted: 12/21/2023] [Indexed: 02/02/2024]
Abstract
The laboratory rat emerges as a useful tool for studying the interaction between the host and its microbiome. To advance principles relevant to the human microbiome, we systematically investigated and defined the multitissue microbial biogeography of healthy Fischer 344 rats across their lifespan. Microbial community profiling data were extracted and integrated with host transcriptomic data from the Sequencing Quality Control consortium. Unsupervised machine learning, correlation, taxonomic diversity and abundance analyses were performed to determine and characterize the rat microbial biogeography and identify four intertissue microbial heterogeneity patterns (P1-P4). We found that the 11 body habitats harbored a greater diversity of microbes than previously suspected. Lactic acid bacteria (LAB) abundance progressively declined in lungs from breastfed newborn to adolescence/adult, and was below detectable levels in elderly rats. Bioinformatics analyses indicate that the abundance of LAB may be modulated by the lung-immune axis. The presence and levels of LAB in lungs were further evaluated by PCR in two validation datasets. The lung, testes, thymus, kidney, adrenal and muscle niches were found to have age-dependent alterations in microbial abundance. The 357 microbial signatures were positively correlated with host genes in cell proliferation (P1), DNA damage repair (P2) and DNA transcription (P3). Our study established a link between the metabolic properties of LAB with lung microbiota maturation and development. Breastfeeding and environmental exposure influence microbiome composition and host health and longevity. The inferred rat microbial biogeography and pattern-specific microbial signatures could be useful for microbiome therapeutic approaches to human health and life quality enhancement.
Collapse
Affiliation(s)
- Lan Zhao
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA.
- VA Palo Alto Health Care System, Palo Alto, CA, USA.
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA.
| | - Christine M Cunningham
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA
- VA Palo Alto Health Care System, Palo Alto, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA
| | - Adam M Andruska
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA
| | - Katharina Schimmel
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA
| | - Md Khadem Ali
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA
| | - Dongeon Kim
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA
- VA Palo Alto Health Care System, Palo Alto, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA
| | - Shenbiao Gu
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA
- VA Palo Alto Health Care System, Palo Alto, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA
| | - Jason L Chang
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA
- VA Palo Alto Health Care System, Palo Alto, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA
| | - Edda Spiekerkoetter
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA
| | - Mark R Nicolls
- Department of Medicine, Division of Pulmonary, Allergy, and Critical Care Medicine, Stanford, CA, USA.
- VA Palo Alto Health Care System, Palo Alto, CA, USA.
- Vera Moulton Wall Center for Pulmonary Vascular Disease, Stanford, CA, USA.
| |
Collapse
|
3
|
Zhao L, Cunningham CM, Andruska AM, Schimmel K, Ali MK, Kim D, Gu S, Chang JL, Spiekerkoetter E, Nicolls MR. Rat microbial biogeography and age-dependent lactic acid bacteria in healthy lungs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.19.541527. [PMID: 37293045 PMCID: PMC10245737 DOI: 10.1101/2023.05.19.541527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The laboratory rat emerges as a useful tool for studying the interaction between the host and its microbiome. To advance principles relevant to the human microbiome, we systematically investigated and defined a multi-tissue full lifespan microbial biogeography for healthy Fischer 344 rats. Microbial community profiling data was extracted and integrated with host transcriptomic data from the Sequencing Quality Control (SEQC) consortium. Unsupervised machine learning, Spearman's correlation, taxonomic diversity, and abundance analyses were performed to determine and characterize the rat microbial biogeography and the identification of four inter-tissue microbial heterogeneity patterns (P1-P4). The 11 body habitats harbor a greater diversity of microbes than previously suspected. Lactic acid bacteria (LAB) abundances progressively declined in lungs from breastfeed newborn to adolescence/adult and was below detectable levels in elderly rats. LAB's presence and levels in lungs were further evaluated by PCR in the two validation datasets. The lung, testes, thymus, kidney, adrenal, and muscle niches were found to have age-dependent alterations in microbial abundance. P1 is dominated by lung samples. P2 contains the largest sample size and is enriched for environmental species. Liver and muscle samples were mostly classified into P3. Archaea species were exclusively enriched in P4. The 357 pattern-specific microbial signatures were positively correlated with host genes in cell migration and proliferation (P1), DNA damage repair and synaptic transmissions (P2), as well as DNA transcription and cell cycle in P3. Our study established a link between metabolic properties of LAB with lung microbiota maturation and development. Breastfeeding and environmental exposure influence microbiome composition and host health and longevity. The inferred rat microbial biogeography and pattern-specific microbial signatures would be useful for microbiome therapeutic approaches to human health and good quality of life.
Collapse
|
4
|
Seyler LM, Kraus EA, McLean C, Spear JR, Templeton AS, Schrenk MO. An untargeted exometabolomics approach to characterize dissolved organic matter in groundwater of the Samail Ophiolite. Front Microbiol 2023; 14:1093372. [PMID: 36970670 PMCID: PMC10033605 DOI: 10.3389/fmicb.2023.1093372] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 01/23/2023] [Indexed: 03/11/2023] Open
Abstract
The process of serpentinization supports life on Earth and gives rise to the habitability of other worlds in our Solar System. While numerous studies have provided clues to the survival strategies of microbial communities in serpentinizing environments on the modern Earth, characterizing microbial activity in such environments remains challenging due to low biomass and extreme conditions. Here, we used an untargeted metabolomics approach to characterize dissolved organic matter in groundwater in the Samail Ophiolite, the largest and best characterized example of actively serpentinizing uplifted ocean crust and mantle. We found that dissolved organic matter composition is strongly correlated with both fluid type and microbial community composition, and that the fluids that were most influenced by serpentinization contained the greatest number of unique compounds, none of which could be identified using the current metabolite databases. Using metabolomics in conjunction with metagenomic data, we detected numerous products and intermediates of microbial metabolic processes and identified potential biosignatures of microbial activity, including pigments, porphyrins, quinones, fatty acids, and metabolites involved in methanogenesis. Metabolomics techniques like the ones used in this study may be used to further our understanding of life in serpentinizing environments, and aid in the identification of biosignatures that can be used to search for life in serpentinizing systems on other worlds.
Collapse
Affiliation(s)
- Lauren M. Seyler
- Department of Earth and Environmental Sciences, Michigan State University, East Lansing, MI, United States
- Biology Program, School of Natural Sciences and Mathematics, Stockton University, Galloway, NJ, United States
- Blue Marble Space Institute of Science, Seattle, WA, United States
- *Correspondence: Lauren M. Seyler,
| | - Emily A. Kraus
- Department of Civil and Environmental Engineering, Colorado School of Mines, Golden, CO, United States
- Department of Environmental Engineering, University of Colorado, Boulder, Boulder, CO, United States
| | - Craig McLean
- Massachusetts Institute of Technology, Cambridge, MA, United States
| | - John R. Spear
- Department of Civil and Environmental Engineering, Colorado School of Mines, Golden, CO, United States
| | - Alexis S. Templeton
- Department of Geological Sciences, University of Colorado, Boulder, Boulder, CO, United States
| | - Matthew O. Schrenk
- Department of Earth and Environmental Sciences, Michigan State University, East Lansing, MI, United States
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, United States
| |
Collapse
|
5
|
Liu X, Yu T, Zhao X, Long C, Han R, Su Z, Li G. ARBic: an all-round biclustering algorithm for analyzing gene expression data. NAR Genom Bioinform 2023; 5:lqad009. [PMID: 36733402 PMCID: PMC9887595 DOI: 10.1093/nargab/lqad009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 01/09/2023] [Accepted: 01/17/2023] [Indexed: 02/04/2023] Open
Abstract
Identifying significant biclusters of genes with specific expression patterns is an effective approach to reveal functionally correlated genes in gene expression data. However, none of existing algorithms can simultaneously identify both broader and narrower biclusters due to their failure of balancing between effectiveness and efficiency. We introduced ARBic, an algorithm which is capable of accurately identifying any significant biclusters of any shape, including broader, narrower and square, in any large scale gene expression dataset. ARBic was designed by integrating column-based and row-based strategies into a single biclustering procedure. The column-based strategy borrowed from RecBic, a recently published biclustering tool, extracts narrower biclusters, while the row-based strategy that iteratively finds the longest path in a specific directed graph, extracts broader ones. Being tested and compared to other seven salient biclustering algorithms on simulated datasets, ARBic achieves at least an average of 29% higher recovery, relevance and[Formula: see text] scores than the best existing tool. In addition, ARBic substantially outperforms all tools on real datasets and is more robust to noises, bicluster shapes and dataset types.
Collapse
Affiliation(s)
- Xiangyu Liu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Jinan 250100, China
| | - Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Jinan 250100, China
| | - Xiaoyu Zhao
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Jinan 250100, China
| | - Chaoyi Long
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Jinan 250100, China
| | - Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Jinan 250100, China
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Guojun Li
- To whom correspondence should be addressed. Tel: +86 532 5863 1923; Fax: +86 532 5863 1929;
| |
Collapse
|
6
|
Robust semi-supervised data representation and imputation by correntropy based constraint nonnegative matrix factorization. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03884-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
7
|
Ko YJ, Kim S, Pan CH, Park K. Identification of Functional Microbial Modules Through Network-Based Analysis of Meta-Microbial Features Using Matrix Factorization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2851-2862. [PMID: 34329170 DOI: 10.1109/tcbb.2021.3100893] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
As the microbiome is composed of a variety of microbial interactions, it is imperative in microbiome research to identify a microbial sub-community that collectively conducts a specific function. However, current methodologies have been highly limited to analyzing conditional abundance changes of individual microorganisms without considering group-wise collective microbial features. To overcome this limitation, we developed a network-based method using nonnegative matrix factorization (NMF) to identify functional meta-microbial features (MMFs) that, as a group, better discriminate specific environmental conditions of samples using microbiome data. As proof of concept, large-scale human microbiome data collected from different body sites were used to identify body site-specific MMFs by applying NMF. The statistical test for MMFs led us to identify highly discriminative MMFs on sample classes, called synergistic MMFs (SYMMFs). Finally, we constructed a SYMMF-based microbial interaction network (SYMMF-net) by integrating all of the SYMMF information. Network analysis revealed core microbial modules closely related to critical sample properties. Similar results were also found when the method was applied to various disease-associated microbiome data. The developed method interprets high-dimensional microbiome data by identifying functional microbial modules on sample properties and intuitively representing their systematic relationships via a microbial network.
Collapse
|
8
|
Barkley D, Moncada R, Pour M, Liberman DA, Dryg I, Werba G, Wang W, Baron M, Rao A, Xia B, França GS, Weil A, Delair DF, Hajdu C, Lund AW, Osman I, Yanai I. Cancer cell states recur across tumor types and form specific interactions with the tumor microenvironment. Nat Genet 2022; 54:1192-1201. [PMID: 35931863 PMCID: PMC9886402 DOI: 10.1038/s41588-022-01141-9] [Citation(s) in RCA: 112] [Impact Index Per Article: 56.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 06/22/2022] [Indexed: 02/01/2023]
Abstract
Transcriptional heterogeneity among malignant cells of a tumor has been studied in individual cancer types and shown to be organized into cancer cell states; however, it remains unclear to what extent these states span tumor types, constituting general features of cancer. Here, we perform a pan-cancer single-cell RNA-sequencing analysis across 15 cancer types and identify a catalog of gene modules whose expression defines recurrent cancer cell states including 'stress', 'interferon response', 'epithelial-mesenchymal transition', 'metal response', 'basal' and 'ciliated'. Spatial transcriptomic analysis linked the interferon response in cancer cells to T cells and macrophages in the tumor microenvironment. Using mouse models, we further found that induction of the interferon response module varies by tumor location and is diminished upon elimination of lymphocytes. Our work provides a framework for studying how cancer cell states interact with the tumor microenvironment to form organized systems capable of immune evasion, drug resistance and metastasis.
Collapse
Affiliation(s)
- Dalia Barkley
- Institute for Computational Medicine, New York, NY, USA
| | | | - Maayan Pour
- Institute for Computational Medicine, New York, NY, USA
| | | | - Ian Dryg
- Department of Dermatology, NYU School of Medicine, New York, NY, USA
| | - Gregor Werba
- Department of Surgery, NYU School of Medicine, New York, NY, USA,Department of Pathology, NYU School of Medicine, New York, NY, USA
| | - Wei Wang
- Department of Surgery, NYU School of Medicine, New York, NY, USA,Department of Pathology, NYU School of Medicine, New York, NY, USA
| | - Maayan Baron
- Institute for Computational Medicine, New York, NY, USA
| | - Anjali Rao
- Institute for Computational Medicine, New York, NY, USA
| | - Bo Xia
- Institute for Computational Medicine, New York, NY, USA
| | | | - Alejandro Weil
- Department of Pathology, NYU School of Medicine, New York, NY, USA
| | | | - Cristina Hajdu
- Department of Pathology, NYU School of Medicine, New York, NY, USA
| | - Amanda W. Lund
- Department of Dermatology, NYU School of Medicine, New York, NY, USA,Department of Surgery, NYU School of Medicine, New York, NY, USA,Perlmutter Cancer Center NYU School of Medicine, New York, NY, USA
| | - Iman Osman
- Department of Dermatology, NYU School of Medicine, New York, NY, USA,Department of Pathology, NYU School of Medicine, New York, NY, USA,Perlmutter Cancer Center NYU School of Medicine, New York, NY, USA
| | - Itai Yanai
- Institute for Computational Medicine, New York, NY, USA,Perlmutter Cancer Center NYU School of Medicine, New York, NY, USA,Corresponding author:
| |
Collapse
|
9
|
Castanho EN, Aidos H, Madeira SC. Biclustering fMRI time series: a comparative study. BMC Bioinformatics 2022; 23:192. [PMID: 35606701 PMCID: PMC9126639 DOI: 10.1186/s12859-022-04733-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 05/13/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The effectiveness of biclustering, simultaneous clustering of rows and columns in a data matrix, was shown in gene expression data analysis. Several researchers recognize its potentialities in other research areas. Nevertheless, the last two decades have witnessed the development of a significant number of biclustering algorithms targeting gene expression data analysis and a lack of consistent studies exploring the capacities of biclustering outside this traditional application domain. RESULTS This work evaluates the potential use of biclustering in fMRI time series data, targeting the Region × Time dimensions by comparing seven state-in-the-art biclustering and three traditional clustering algorithms on artificial and real data. It further proposes a methodology for biclustering evaluation beyond gene expression data analysis. The results discuss the use of different search strategies in both artificial and real fMRI time series showed the superiority of exhaustive biclustering approaches, obtaining the most homogeneous biclusters. However, their high computational costs are a challenge, and further work is needed for the efficient use of biclustering in fMRI data analysis. CONCLUSIONS This work pinpoints avenues for the use of biclustering in spatio-temporal data analysis, in particular neurosciences applications. The proposed evaluation methodology showed evidence of the effectiveness of biclustering in finding local patterns in fMRI time series data. Further work is needed regarding scalability to promote the application in real scenarios.
Collapse
Affiliation(s)
| | - Helena Aidos
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - Sara C Madeira
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| |
Collapse
|
10
|
Karagiannaki I, Gourlia K, Lagani V, Pantazis Y, Tsamardinos I. Learning biologically-interpretable latent representations for gene expression data: Pathway Activity Score Learning Algorithm. Mach Learn 2022; 112:4257-4287. [PMID: 37900054 PMCID: PMC10600308 DOI: 10.1007/s10994-022-06158-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 11/12/2021] [Accepted: 02/19/2022] [Indexed: 11/24/2022]
Abstract
Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (i.e., high dimensional data). However, lower-dimensional representations that retain the useful biological information do exist. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways (genesets in general) and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL's latent space has a fairly straightforward biological interpretation. PASL is shown to outperform in predictive performance the state-of-the-art method (PLIER) on two collections of breast cancer and leukemia gene expression datasets. PASL is also trained on a large corpus of 50000 gene expression samples to construct a universal dictionary of features across different tissues and pathologies. The dictionary validated on 35643 held-out samples for reconstruction error. It is then applied on 165 held-out datasets spanning a diverse range of diseases. The AutoML tool JADBio is employed to show that the predictive information in the PASL-created feature space is retained after the transformation. The code is available at https://github.com/mensxmachina/PASL.
Collapse
Affiliation(s)
- Ioulia Karagiannaki
- Institute of Electronic Structure and Laser, Foundation for Research and Technology-Hellas (IESL-FORTH), Heraklion, Greece
| | | | - Vincenzo Lagani
- Institute of Chemical Biology, Ilia State University, Tbilisi, 0162 Georgia
- JADBio, Gnosis Data Analysis PC, Heraklion, Crete Greece
| | - Yannis Pantazis
- Institute of Applied and Computational Mathematics, Foundation for Research and Technology - Hellas, Heraklion, Greece
| | - Ioannis Tsamardinos
- Department of Computer Science, University of Crete, Heraklion, Greece
- JADBio, Gnosis Data Analysis PC, Heraklion, Crete Greece
- Institute of Applied and Computational Mathematics, Foundation for Research and Technology - Hellas, Heraklion, Greece
| |
Collapse
|
11
|
Zhao L, Cho WC, Luo JL. Exploring the patient-microbiome interaction patterns for pan-cancer. Comput Struct Biotechnol J 2022; 20:3068-3079. [PMID: 35782745 PMCID: PMC9233187 DOI: 10.1016/j.csbj.2022.06.012] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 06/06/2022] [Accepted: 06/06/2022] [Indexed: 11/03/2022] Open
Abstract
Cancer subtype-specific sets of microbiomes, making pan-cancer heterogeneity at the microbial level. Approximately 60% of the untreated cancer patients have experienced microbial composition changes in their tumor tissues. Colorectal cancer (CRC) was largely composed of two subtypes (S4 and S6) driven by different microbial profiles. The identified seven pan-cancer subtypes with 424 subtype-specific microbial signatures will help us find new therapeutic targets and better treatment strategies for cancer patients.
Microbes play important roles in human health and disease. Immunocompromised cancer patients are more vulnerable to getting microbial infections. Regions of hypoxia and acidic tumor microenvironment shape the microbial community diversity and abundance. Each cancer has its own microbiome, making cancer-specific sets of microbiomes. High-throughput profiling technologies provide a culture-free approach for microbial profiling in tumor samples. Microbial compositional data was extracted and examined from the TCGA unmapped transcriptome data. Biclustering, correlation, and statistical analyses were performed to determine the seven patient-microbe interaction patterns. These two-dimensional patterns consist of a group of microbial species that show significant over-representation over the 7 pan-cancer subtypes (S1-S7), respectively. Approximately 60% of the untreated cancer patients have experienced tissue microbial composition and functional changes between subtypes and normal controls. Among these changes, subtype S5 had loss of microbial diversity as well as impaired immune functions. S1, S2, and S3 had been enriched with microbial signatures derived from the Gammaproteobacteria, Actinobacteria and Betaproteobacteria, respectively. Colorectal cancer (CRC) was largely composed of two subtypes, namely S4 and S6, driven by different microbial profiles. S4 patients had increased microbial load, and were enriched with CRC-related oncogenic pathways. S6 CRC together with other cancer patients, making up almost 40% of all cases were classified into the S6 subtype, which not only resembled the normal control’s microbiota but also retained their original “normal-like” functions. Lastly, the S7 was a rare and understudied subtype. Our study investigated the pan-cancer heterogeneity at the microbial level. The identified seven pan-cancer subtypes with 424 subtype-specific microbial signatures will help us find new therapeutic targets and better treatment strategies for cancer patients.
Collapse
|
12
|
Liu R, Liu L, Zhou Y. m6Adecom: analysis of m6A profile matrix based on graph regularized non-negative matrix factorization. Methods 2022; 203:322-327. [DOI: 10.1016/j.ymeth.2022.01.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 01/12/2022] [Accepted: 01/21/2022] [Indexed: 01/07/2023] Open
|
13
|
Gene Expression Analysis through Parallel Non-Negative Matrix Factorization. COMPUTATION 2021. [DOI: 10.3390/computation9100106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Genetic expression analysis is a principal tool to explain the behavior of genes in an organism when exposed to different experimental conditions. In the state of art, many clustering algorithms have been proposed. It is overwhelming the amount of biological data whose high-dimensional structure exceeds mostly current computational architectures. The computational time and memory consumption optimization actually become decisive factors in choosing clustering algorithms. We propose a clustering algorithm based on Non-negative Matrix Factorization and K-means to reduce data dimensionality but whilst preserving the biological context and prioritizing gene selection, and it is implemented within parallel GPU-based environments through the CUDA library. A well-known dataset is used in our tests and the quality of the results is measured through the Rand and Accuracy Index. The results show an increase in the acceleration of 6.22× compared to the sequential version. The algorithm is competitive in the biological datasets analysis and it is invariant with respect to the classes number and the size of the gene expression matrix.
Collapse
|
14
|
Zhang S, Li X, Lin Q, Wong KC. Nature-Inspired Compressed Sensing for Transcriptomic Profiling From Random Composite Measurements. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4476-4487. [PMID: 31751263 DOI: 10.1109/tcyb.2019.2951402] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Transcriptomic profiling is a high-throughput approach to measure gene expression levels under different experimental conditions at different timings. With the development of the related technologies such as single-cell RNA-Seq, the dimensions of gene expression data are increased to hundreds of thousands or more for high-resolution insights. There is a long-lasting challenge in exploiting the relations between transcriptomic profiles and random composite measurements. To address it, we proposed a mathematical framework based on differential evolution (global search) with the help of compressed sensing (local search) termed as DECS. Exploiting the inherent sparse nature of gene expression data, the proposed DECS method can learn the sparse module dictionaries and levels from the low-dimensional random composite measurements for reconstructing the high-dimensional gene expression data with significant orders of magnitude (e.g. 200 × ). Several experiments were conducted to compare DECS with three benchmark methods, demonstrating that the proposed DECS outperforms the benchmark methods and can recover most of the gene expression patterns. The underlying reasons are discussed and illustrated by revealing the related mechanistic insights through extensive benchmarks on nine GSE datasets and their sensitivity analysis.
Collapse
|
15
|
Chung Y, Lee H. Correlation between Alzheimer's disease and type 2 diabetes using non-negative matrix factorization. Sci Rep 2021; 11:15265. [PMID: 34315930 PMCID: PMC8316581 DOI: 10.1038/s41598-021-94048-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 06/24/2021] [Indexed: 02/07/2023] Open
Abstract
Alzheimer's disease (AD) is a complex and heterogeneous disease that can be affected by various genetic factors. Although the cause of AD is not yet known and there is no treatment to cure this disease, its progression can be delayed. AD has recently been recognized as a brain-specific type of diabetes called type 3 diabetes. Several studies have shown that people with type 2 diabetes (T2D) have a higher risk of developing AD. Therefore, it is important to identify subgroups of patients with AD that may be more likely to be associated with T2D. We here describe a new approach to identify the correlation between AD and T2D at the genetic level. Subgroups of AD and T2D were each generated using a non-negative matrix factorization (NMF) approach, which generated clusters containing subsets of genes and samples. In the gene cluster that was generated by conventional gene clustering method from NMF, we selected genes with significant differences in the corresponding sample cluster by Kruskal-Wallis and Dunn-test. Subsequently, we extracted differentially expressed gene (DEG) subgroups, and candidate genes with the same regulation direction can be extracted at the intersection of two disease DEG subgroups. Finally, we identified 241 candidate genes that represent common features related to both AD and T2D, and based on pathway analysis we propose that these genes play a role in the common pathological features of AD and T2D. Moreover, in the prediction of AD using logistic regression analysis with an independent AD dataset, the candidate genes obtained better prediction performance than DEGs. In conclusion, our study revealed a subgroup of patients with AD that are associated with T2D and candidate genes associated between AD and T2D, which can help in providing personalized and suitable treatments.
Collapse
Affiliation(s)
- Yeonwoo Chung
- grid.61221.360000 0001 1033 9831School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Korea.
| | | |
Collapse
|
16
|
Liu Z, Lu T, Wang L, Liu L, Li L, Han X. Comprehensive Molecular Analyses of a Novel Mutational Signature Classification System with Regard to Prognosis, Genomic Alterations, and Immune Landscape in Glioma. Front Mol Biosci 2021; 8:682084. [PMID: 34307451 PMCID: PMC8293748 DOI: 10.3389/fmolb.2021.682084] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Accepted: 06/28/2021] [Indexed: 11/26/2022] Open
Abstract
Background: Glioma is the most common malignant brain tumor with complex carcinogenic process and poor prognosis. The current molecular classification cannot fully elucidate the molecular diversity of glioma. Methods: Using broad public datasets, we performed cluster analysis based on the mutational signatures and further investigated the multidimensional heterogeneity of the novel glioma molecular subtypes. The clinical significance and immune landscape of four clusters also investigated. The nomogram was developed using the mutational clusters and clinical characteristics. Results: Four heterogenous clusters were identified, termed C1, C2, C3, and C4, respectively. These clusters presented distinct molecular features: C1 was characterized by signature 1, PTEN mutation, chromosome seven amplification and chromosome 10 deletion; C2 was characterized by signature 8 and FLG mutation; C3 was characterized by signature 3 and 13, ATRX and TP53 mutations, and 11p15.1, 11p15.5, and 13q14.2 deletions; and C4 was characterized by signature 16, IDH1 mutation and chromosome 1p and 19q deletions. These clusters also varied in biological functions and immune status. We underlined the potential immune escape mechanisms: abundant stromal and immunosuppressive cells infiltration and immune checkpoints (ICPs) blockade in C1; lack of immune cells, low immunogenicity and antigen presentation defect in C2 and C4; and ICPs blockade in C3. Moreover, C4 possessed a better prognosis, and C1 and C3 were more likely to benefit from immunotherapy. A nomogram with excellent performance was also developed for assessing the prognosis of patients with glioma. Conclusion: Our results can enhance the mastery of molecular features and facilitate the precise treatment and clinical management of glioma.
Collapse
Affiliation(s)
- Zaoqu Liu
- Department of Interventional Radiology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Interventional Institute of Zhengzhou University, Zhengzhou, China.,Interventional Treatment and Clinical Research Center of Henan Province, Zhengzhou, China
| | - Taoyuan Lu
- Department of Cerebrovascular Disease, Zhengzhou University People's Hospital, Zhengzhou, China
| | - Libo Wang
- Department of Hepatobiliary and Pancreatic Surgery, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Long Liu
- Department of Hepatobiliary and Pancreatic Surgery, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Lifeng Li
- Internet Medical and System Applications of National Engineering Laboratory, Zhengzhou, China
| | - Xinwei Han
- Department of Interventional Radiology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.,Interventional Institute of Zhengzhou University, Zhengzhou, China.,Interventional Treatment and Clinical Research Center of Henan Province, Zhengzhou, China
| |
Collapse
|
17
|
Zhang C, Zhang S. Bayesian Joint Matrix Decomposition for Data Integration with Heterogeneous Noise. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:1184-1196. [PMID: 31603812 DOI: 10.1109/tpami.2019.2946370] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Matrix decomposition is a popular and fundamental approach in machine learning and data mining. It has been successfully applied into various fields. Most matrix decomposition methods focus on decomposing a data matrix from one single source. However, it is common that data are from different sources with heterogeneous noise. A few of the matrix decomposition methods have been extended for such multi-view data integration and pattern discovery while only a few methods were designed to consider the heterogeneity of noise in such multi-view data for data integration explicitly. To this end, in this article, we propose a joint matrix decomposition framework (BJMD), which models the heterogeneity of noise by the Gaussian distribution in a Bayesian framework. We develop two algorithms to solve this model: one is a variational Bayesian inference algorithm, which makes full use of the posterior distribution; and another is a maximum a posterior algorithm, which is more scalable and can be easily paralleled. Extensive experiments on synthetic and real-world datasets demonstrate that BJMD is superior or competitive to the state-of-the-art methods.
Collapse
|
18
|
Lee Y, Bogdanoff D, Wang Y, Hartoularos GC, Woo JM, Mowery CT, Nisonoff HM, Lee DS, Sun Y, Lee J, Mehdizadeh S, Cantlon J, Shifrut E, Ngyuen DN, Roth TL, Song YS, Marson A, Chow ED, Ye CJ. XYZeq: Spatially resolved single-cell RNA sequencing reveals expression heterogeneity in the tumor microenvironment. SCIENCE ADVANCES 2021; 7:7/17/eabg4755. [PMID: 33883145 PMCID: PMC8059935 DOI: 10.1126/sciadv.abg4755] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Accepted: 03/04/2021] [Indexed: 05/07/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) of tissues has revealed remarkable heterogeneity of cell types and states but does not provide information on the spatial organization of cells. To better understand how individual cells function within an anatomical space, we developed XYZeq, a workflow that encodes spatial metadata into scRNA-seq libraries. We used XYZeq to profile mouse tumor models to capture spatially barcoded transcriptomes from tens of thousands of cells. Analyses of these data revealed the spatial distribution of distinct cell types and a cell migration-associated transcriptomic program in tumor-associated mesenchymal stem cells (MSCs). Furthermore, we identify localized expression of tumor suppressor genes by MSCs that vary with proximity to the tumor core. We demonstrate that XYZeq can be used to map the transcriptome and spatial localization of individual cells in situ to reveal how cell composition and cell states can be affected by location within complex pathological tissue.
Collapse
Affiliation(s)
- Youjin Lee
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA 94143, USA.
- Diabetes Center, University of California, San Francisco, San Francisco, CA 94143, USA
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- J. David Gladstone Institutes, San Francisco, CA 94158, USA
| | - Derek Bogdanoff
- Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA 94158, USA
- Center for Advanced Technology, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Yutong Wang
- Graduate Group in Biostatistics, University of California, Berkeley, CA 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
| | - George C Hartoularos
- Graduate Program in Biological and Medical Informatics, University of California, San Francisco, San Francisco, CA 94158, USA
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA 94143, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Jonathan M Woo
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA 94143, USA
- Diabetes Center, University of California, San Francisco, San Francisco, CA 94143, USA
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- J. David Gladstone Institutes, San Francisco, CA 94158, USA
| | - Cody T Mowery
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA 94143, USA
- Diabetes Center, University of California, San Francisco, San Francisco, CA 94143, USA
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- J. David Gladstone Institutes, San Francisco, CA 94158, USA
- Medical Scientist Training Program, University of California, San Francisco, San Francisco, CA 94143, USA
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Hunter M Nisonoff
- Center for Computational Biology, University of California, Berkeley, CA 94720, USA
| | - David S Lee
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA 94143, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Yang Sun
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA 94143, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA 94143, USA
| | - James Lee
- Division of Hematology and Oncology, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Sadaf Mehdizadeh
- Diabetes Center, University of California, San Francisco, San Francisco, CA 94143, USA
| | | | - Eric Shifrut
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA 94143, USA
- Diabetes Center, University of California, San Francisco, San Francisco, CA 94143, USA
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- J. David Gladstone Institutes, San Francisco, CA 94158, USA
| | - David N Ngyuen
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA 94143, USA
- Diabetes Center, University of California, San Francisco, San Francisco, CA 94143, USA
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- J. David Gladstone Institutes, San Francisco, CA 94158, USA
- Department of Medicine, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Theodore L Roth
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA 94143, USA
- Diabetes Center, University of California, San Francisco, San Francisco, CA 94143, USA
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- Medical Scientist Training Program, University of California, San Francisco, San Francisco, CA 94143, USA
- Biomedical Sciences Graduate Program, University of California, San Francisco, San Francisco, CA 94143, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA 94720, USA
- Department of Statistics, University of California, Berkeley, CA 94720, USA
- Chan Zuckerberg Biohub, San Francisco, CA 94158, USA
| | - Alexander Marson
- Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA 94143, USA.
- Diabetes Center, University of California, San Francisco, San Francisco, CA 94143, USA
- Innovative Genomics Institute, University of California, Berkeley, Berkeley, CA 94720, USA
- J. David Gladstone Institutes, San Francisco, CA 94158, USA
- Department of Medicine, University of California, San Francisco, San Francisco, CA 94143, USA
- Chan Zuckerberg Biohub, San Francisco, CA 94158, USA
- UCSF Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA 94158, USA
- Parker Institute for Cancer Immunotherapy, University of California, San Francisco, San Francisco, CA 94129, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA 94143, USA
- Gladstone-UCSF Institute of Genomic Immunology, San Francisco, CA 94158, USA
| | - Eric D Chow
- Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA 94158, USA.
- Center for Advanced Technology, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA 94143, USA.
- Chan Zuckerberg Biohub, San Francisco, CA 94158, USA
- Parker Institute for Cancer Immunotherapy, University of California, San Francisco, San Francisco, CA 94129, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA 94143, USA
- Gladstone-UCSF Institute of Genomic Immunology, San Francisco, CA 94158, USA
- Institute of Computational Health Sciences, University of California, San Francisco, San Francisco, CA 94143, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA 94158, USA
| |
Collapse
|
19
|
Liu X, Li D, Liu J, Su Z, Li G. RecBic: a fast and accurate algorithm recognizing trend-preserving biclusters. Bioinformatics 2021; 36:5054-5060. [PMID: 32653907 DOI: 10.1093/bioinformatics/btaa630] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 06/24/2020] [Accepted: 07/06/2020] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Biclustering has emerged as a powerful approach to identifying functional patterns in complex biological data. However, existing tools are limited by their accuracy and efficiency to recognize various kinds of complex biclusters submerged in ever large datasets. We introduce a novel fast and highly accurate algorithm RecBic to identify various forms of complex biclusters in gene expression datasets. RESULTS We designed RecBic to identify various trend-preserving biclusters, particularly, those with narrow shapes, i.e. clusters where the number of genes is larger than the number of conditions/samples. Given a gene expression matrix, RecBic starts with a column seed, and grows it into a full-sized bicluster by simply repetitively comparing real numbers. When tested on simulated datasets in which the elements of implanted trend-preserving biclusters and those of the background matrix have the same distribution, RecBic was able to identify the implanted biclusters in a nearly perfect manner, outperforming all the compared salient tools in terms of accuracy and robustness to noise and overlaps between the clusters. Moreover, RecBic also showed superiority in identifying functionally related genes in real gene expression datasets. AVAILABILITY AND IMPLEMENTATION Code, sample input data and usage instructions are available at the following websites. Code: https://github.com/holyzews/RecBic/tree/master/RecBic/. Data: http://doi.org/10.5281/zenodo.3842717. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiangyu Liu
- Research Center for Mathematics and Interdisciplinary Sciences.,School of Mathematics, Shandong University, Jinan 250100, China
| | - Di Li
- Research Center for Mathematics and Interdisciplinary Sciences.,School of Mathematics, Shandong University, Jinan 250100, China
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences.,School of Mathematics, Shandong University, Jinan 250100, China
| |
Collapse
|
20
|
Lemsara A, Ouadfel S, Fröhlich H. PathME: pathway based multi-modal sparse autoencoders for clustering of patient-level multi-omics data. BMC Bioinformatics 2020; 21:146. [PMID: 32299344 PMCID: PMC7161108 DOI: 10.1186/s12859-020-3465-2] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 03/23/2020] [Indexed: 02/08/2023] Open
Abstract
Background Recent years have witnessed an increasing interest in multi-omics data, because these data allow for better understanding complex diseases such as cancer on a molecular system level. In addition, multi-omics data increase the chance to robustly identify molecular patient sub-groups and hence open the door towards a better personalized treatment of diseases. Several methods have been proposed for unsupervised clustering of multi-omics data. However, a number of challenges remain, such as the magnitude of features and the large difference in dimensionality across different omics data sources. Results We propose a multi-modal sparse denoising autoencoder framework coupled with sparse non-negative matrix factorization to robustly cluster patients based on multi-omics data. The proposed model specifically leverages pathway information to effectively reduce the dimensionality of omics data into a pathway and patient specific score profile. In consequence, our method allows us to understand, which pathway is a feature of which particular patient cluster. Moreover, recently proposed machine learning techniques allow us to disentangle the specific impact of each individual omics feature on a pathway score. We applied our method to cluster patients in several cancer datasets using gene expression, miRNA expression, DNA methylation and CNVs, demonstrating the possibility to obtain biologically plausible disease subtypes characterized by specific molecular features. Comparison against several competing methods showed a competitive clustering performance. In addition, post-hoc analysis of somatic mutations and clinical data provided supporting evidence and interpretation of the identified clusters. Conclusions Our suggested multi-modal sparse denoising autoencoder approach allows for an effective and interpretable integration of multi-omics data on pathway level while addressing the high dimensional character of omics data. Patient specific pathway score profiles derived from our model allow for a robust identification of disease subgroups.
Collapse
Affiliation(s)
- Amina Lemsara
- Computer Science Department, University of Constantine 2, 25016, Constantine, Algeria
| | - Salima Ouadfel
- Computer Science Department, University of Constantine 2, 25016, Constantine, Algeria
| | - Holger Fröhlich
- University of Bonn, Bonn-Aachen, International Center for IT, 53115, Bonn, Germany. .,Fraunhofer Institute for, Algorithms and Scientific, Computing (SCAI), 53754, Sankt, Augustin, Germany.
| |
Collapse
|
21
|
Moncada R, Barkley D, Wagner F, Chiodin M, Devlin JC, Baron M, Hajdu CH, Simeone DM, Yanai I. Integrating microarray-based spatial transcriptomics and single-cell RNA-seq reveals tissue architecture in pancreatic ductal adenocarcinomas. Nat Biotechnol 2020; 38:333-342. [PMID: 31932730 DOI: 10.1038/s41587-019-0392-8] [Citation(s) in RCA: 426] [Impact Index Per Article: 106.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2019] [Accepted: 12/11/2019] [Indexed: 12/12/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) enables the systematic identification of cell populations in a tissue, but characterizing their spatial organization remains challenging. We combine a microarray-based spatial transcriptomics method that reveals spatial patterns of gene expression using an array of spots, each capturing the transcriptomes of multiple adjacent cells, with scRNA-Seq generated from the same sample. To annotate the precise cellular composition of distinct tissue regions, we introduce a method for multimodal intersection analysis. Applying multimodal intersection analysis to primary pancreatic tumors, we find that subpopulations of ductal cells, macrophages, dendritic cells and cancer cells have spatially restricted enrichments, as well as distinct coenrichments with other cell types. Furthermore, we identify colocalization of inflammatory fibroblasts and cancer cells expressing a stress-response gene module. Our approach for mapping the architecture of scRNA-seq-defined subpopulations can be applied to reveal the interactions inherent to complex tissues.
Collapse
Affiliation(s)
- Reuben Moncada
- Institute for Computational Medicine, NYU Langone Health, New York, NY, USA
| | - Dalia Barkley
- Institute for Computational Medicine, NYU Langone Health, New York, NY, USA
| | - Florian Wagner
- Institute for Computational Medicine, NYU Langone Health, New York, NY, USA
| | - Marta Chiodin
- Institute for Computational Medicine, NYU Langone Health, New York, NY, USA
| | - Joseph C Devlin
- Institute for Computational Medicine, NYU Langone Health, New York, NY, USA
| | - Maayan Baron
- Institute for Computational Medicine, NYU Langone Health, New York, NY, USA
| | | | - Diane M Simeone
- Department of Pathology, NYU Langone Health, New York, NY, USA
- Department of Surgery, NYU Langone Health, New York, NY, USA
- Perlmutter Cancer Center, NYU Langone Health, New York, NY, USA
| | - Itai Yanai
- Institute for Computational Medicine, NYU Langone Health, New York, NY, USA.
- Department of Biochemistry and Molecular Pharmacology, NYU Langone Health, New York, NY, USA.
| |
Collapse
|
22
|
Appice A, Tsoumakas G, Manolopoulos Y, Matwin S. Pathway Activity Score Learning for Dimensionality Reduction of Gene Expression Data. DISCOVERY SCIENCE 2020. [PMCID: PMC7556388 DOI: 10.1007/978-3-030-61527-7_17] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
Abstract
Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (e.g., high dimensional data). However, there exist lower-dimensional representations that retain the useful information. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL’s latent space has a relatively straight-forward biological interpretation. As a use-case, PASL is applied on two collections of breast cancer and leukemia gene expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as outperforming PLIER, a recently proposed competitive method. We also show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is available at https://github.com/mensxmachina/PASL.
Collapse
|
23
|
Wu MJ, Gao YL, Liu JX, Zhu R, Wang J. Principal Component Analysis Based on Graph Laplacian and Double Sparse Constraints for Feature Selection and Sample Clustering on Multi-View Data. Hum Hered 2019; 84:47-58. [PMID: 31466072 DOI: 10.1159/000501653] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 06/23/2019] [Indexed: 11/19/2022] Open
Abstract
Principal component analysis (PCA) is a widely used method for evaluating low-dimensional data. Some variants of PCA have been proposed to improve the interpretation of the principal components (PCs). One of the most common methods is sparse PCA which aims at finding a sparse basis to improve the interpretability over the dense basis of PCA. However, the performances of these improved methods are still far from satisfactory because the data still contain redundant PCs. In this paper, a novel method called PCA based on graph Laplacian and double sparse constraints (GDSPCA) is proposed to improve the interpretation of the PCs and consider the internal geometry of the data. In detail, GDSPCA utilizes L2,1-norm and L1-norm regularization terms simultaneously to enforce the matrix to be sparse by filtering redundant and irrelative PCs, where the L2,1-norm regularization term can produce row sparsity, while the L1-norm regularization term can enforce element sparsity. This way, we can make a better interpretation of the new PCs in low-dimensional subspace. Meanwhile, the method of GDSPCA integrates graph Laplacian into PCA to explore the geometric structure hidden in the data. A simple and effective optimization solution is provided. Extensive experiments on multi-view biological data demonstrate the feasibility and effectiveness of the proposed approach.
Collapse
Affiliation(s)
- Ming-Juan Wu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China,
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Rong Zhu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Juan Wang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
24
|
Woo J, Winterhoff BJ, Starr TK, Aliferis C, Wang J. De novo prediction of cell-type complexity in single-cell RNA-seq and tumor microenvironments. Life Sci Alliance 2019; 2:2/4/e201900443. [PMID: 31266885 PMCID: PMC6607449 DOI: 10.26508/lsa.201900443] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Accepted: 06/24/2019] [Indexed: 12/30/2022] Open
Abstract
This study describes a computational method for determining statistical support to varying levels of heterogeneity provided by single-cell RNA-sequencing data with applications to tumor samples. Recent single-cell transcriptomic studies revealed new insights into cell-type heterogeneities in cellular microenvironments unavailable from bulk studies. A significant drawback of currently available algorithms is the need to use empirical parameters or rely on indirect quality measures to estimate the degree of complexity, i.e., the number of subgroups present in the sample. We fill this gap with a single-cell data analysis procedure allowing for unambiguous assessments of the depth of heterogeneity in subclonal compositions supported by data. Our approach combines nonnegative matrix factorization, which takes advantage of the sparse and nonnegative nature of single-cell RNA count data, with Bayesian model comparison enabling de novo prediction of the depth of heterogeneity. We show that the method predicts the correct number of subgroups using simulated data, primary blood mononuclear cell, and pancreatic cell data. We applied our approach to a collection of single-cell tumor samples and found two qualitatively distinct classes of cell-type heterogeneity in cancer microenvironments.
Collapse
Affiliation(s)
- Jun Woo
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA.,Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA
| | - Boris J Winterhoff
- Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA.,Department of Obstetrics, Gynecology and Women's Health, University of Minnesota, Minneapolis, MN, USA
| | - Timothy K Starr
- Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA.,Department of Obstetrics, Gynecology and Women's Health, University of Minnesota, Minneapolis, MN, USA
| | - Constantin Aliferis
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
| | - Jinhua Wang
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA .,Masonic Cancer Center, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
25
|
Winham SJ, Larson NB, Armasu SM, Fogarty ZC, Larson MC, McCauley BM, Wang C, Lawrenson K, Gayther S, Cunningham JM, Fridley BL, Goode EL. Molecular signatures of X chromosome inactivation and associations with clinical outcomes in epithelial ovarian cancer. Hum Mol Genet 2019; 28:1331-1342. [PMID: 30576442 DOI: 10.1093/hmg/ddy444] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 10/12/2018] [Accepted: 12/14/2018] [Indexed: 12/19/2022] Open
Abstract
X chromosome inactivation (XCI) is a key epigenetic gene expression regulatory process, which may play a role in women's cancer. In particular tissues, some genes are known to escape XCI, yet patterns of XCI in ovarian cancer (OC) and their clinical associations are largely unknown. To examine XCI in OC, we integrated germline genotype with tumor copy number, gene expression and DNA methylation information from 99 OC patients. Approximately 10% of genes showed different XCI status (either escaping or being subject to XCI) compared with the studies of other tissues. Many of these genes are known oncogenes or tumor suppressors (e.g. DDX3X, TRAPPC2 and TCEANC). We also observed strong association between cis promoter DNA methylation and allele-specific expression imbalance (P = 2.0 × 10-10). Cluster analyses of the integrated data identified two molecular subgroups of OC patients representing those with regulated (N = 47) and dysregulated (N = 52) XCI. This XCI cluster membership was associated with expression of X inactive specific transcript (P = 0.002), a known driver of XCI, as well as age, grade, stage, tumor histology and extent of residual disease following surgical debulking. Patients with dysregulated XCI (N = 52) had shorter time to recurrence (HR = 2.34, P = 0.001) and overall survival time (HR = 1.87, P = 0.02) than those with regulated XCI, although results were attenuated after covariate adjustment. Similar findings were observed when restricted to high-grade serous tumors. We found evidence of a unique OC XCI profile, suggesting that XCI may play an important role in OC biology. Additional studies to examine somatic changes with paired tumor-normal tissue are needed.
Collapse
Affiliation(s)
- Stacey J Winham
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Nicholas B Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Sebastian M Armasu
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Zachary C Fogarty
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Melissa C Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Brian M McCauley
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Chen Wang
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Kate Lawrenson
- Women's Cancer Program, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA.,Center for Bioinformatics and Functional Genomics, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Simon Gayther
- Center for Bioinformatics and Functional Genomics, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Julie M Cunningham
- Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA
| | - Brooke L Fridley
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Ellen L Goode
- Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
26
|
Esposito F, Gillis N, Del Buono N. Orthogonal joint sparse NMF for microarray data analysis. J Math Biol 2019; 79:223-247. [PMID: 31004215 DOI: 10.1007/s00285-019-01355-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Revised: 03/29/2019] [Indexed: 12/20/2022]
Abstract
The 3D microarrays, generally known as gene-sample-time microarrays, couple the information on different time points collected by 2D microarrays that measure gene expression levels among different samples. Their analysis is useful in several biomedical applications, like monitoring dose or drug treatment responses of patients over time in pharmacogenomics studies. Many statistical and data analysis tools have been used to extract useful information. In particular, nonnegative matrix factorization (NMF), with its natural nonnegativity constraints, has demonstrated its ability to extract from 2D microarrays relevant information on specific genes involved in the particular biological process. In this paper, we propose a new NMF model, namely Orthogonal Joint Sparse NMF, to extract relevant information from 3D microarrays containing the time evolution of a 2D microarray, by adding additional constraints to enforce important biological proprieties useful for further biological analysis. We develop multiplicative updates rules that decrease the objective function monotonically, and compare our approach to state-of-the-art NMF algorithms on both synthetic and real data sets.
Collapse
Affiliation(s)
- Flavia Esposito
- Department of Mathematics, University of Bari Aldo Moro, via E. Orabona 4, 70125, Bari, Italy. .,INDAM Research Group GNCS, Roma, Italy.
| | - Nicolas Gillis
- Department of Mathematics and Operational Research, Université de Mons, Rue de Houdain 9, 7000, Mons, Belgium
| | - Nicoletta Del Buono
- Department of Mathematics, University of Bari Aldo Moro, via E. Orabona 4, 70125, Bari, Italy.,INDAM Research Group GNCS, Roma, Italy
| |
Collapse
|
27
|
Laplacian regularized low-rank representation for cancer samples clustering. Comput Biol Chem 2018; 78:504-509. [PMID: 30528509 DOI: 10.1016/j.compbiolchem.2018.11.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 11/07/2018] [Indexed: 12/18/2022]
Abstract
Cancer samples clustering based on biomolecular data has been becoming an important tool for cancer classification. The recognition of cancer types is of great importance for cancer treatment. In this paper, in order to improve the accuracy of cancer recognition, we propose to use Laplacian regularized Low-Rank Representation (LLRR) to cluster the cancer samples based on genomic data. In LLRR method, the high-dimensional genomic data are approximately treated as samples extracted from a combination of several low-rank subspaces. The purpose of LLRR method is to seek the lowest-rank representation matrix based on a dictionary. Because a Laplacian regularization based on manifold is introduced into LLRR, compared to the Low-Rank Representation (LRR) method, besides capturing the global geometric structure, LLRR can capture the intrinsic local structure of high-dimensional observation data well. And what is more, in LLRR, the original data themselves are selected as a dictionary, so the lowest-rank representation is actually a similar expression between the samples. Therefore, corresponding to the low-rank representation matrix, the samples with high similarity are considered to come from the same subspace and are grouped into a class. The experiment results on real genomic data illustrate that LLRR method, compared with LRR and MLLRR, is more robust to noise and has a better ability to learn the inherent subspace structure of data, and achieves remarkable performance in the clustering of cancer samples.
Collapse
|
28
|
Carmona-Sáez P, Varela N, Luque MJ, Toro-Domínguez D, Martorell-Marugan J, Alarcón-Riquelme ME, Marañón C. Metagene projection characterizes GEN2.2 and CAL-1 as relevant human plasmacytoid dendritic cell models. Bioinformatics 2018; 33:3691-3695. [PMID: 28961902 DOI: 10.1093/bioinformatics/btx502] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2016] [Accepted: 08/06/2017] [Indexed: 12/24/2022] Open
Abstract
Motivation Plasmacytoid dendritic cells (pDC) play a major role in the regulation of adaptive and innate immunity. Human pDC are difficult to isolate from peripheral blood and do not survive in culture making the study of their biology challenging. Recently, two leukemic counterparts of pDC, CAL-1 and GEN2.2, have been proposed as representative models of human pDC. Nevertheless, their relationship with pDC has been established only by means of particular functional and phenotypic similarities. With the aim of characterizing GEN2.2 and CAL-1 in the context of the main circulating immune cell populations we have performed microarray gene expression profiling of GEN2.2 and carried out an integrated analysis using publicly available gene expression datasets of CAL-1 and the main circulating primary leukocyte lineages. Results Our results show that GEN2.2 and CAL-1 share common gene expression programs with primary pDC, clustering apart from the rest of circulating hematopoietic lineages. We have also identified common differentially expressed genes that can be relevant in pDC biology. In addition, we have revealed the common and differential pathways activated in primary pDC and cell lines upon CpG stimulatio. Availability and implementation R code and data are available in the supplementary material. Contact pedro.carmona@genyo.es or concepcion.maranon@genyo.es. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Nieves Varela
- Genomic Medicine Department, GENYO, Centre for Genomics and Oncological Research, Pfizer/University of Granada/Andalusian Regional Government, PTS 18016, Granada, Spain
| | - María José Luque
- Genomic Medicine Department, GENYO, Centre for Genomics and Oncological Research, Pfizer/University of Granada/Andalusian Regional Government, PTS 18016, Granada, Spain
| | - Daniel Toro-Domínguez
- Bioinformatics Unit.,Genomic Medicine Department, GENYO, Centre for Genomics and Oncological Research, Pfizer/University of Granada/Andalusian Regional Government, PTS 18016, Granada, Spain
| | | | - Marta E Alarcón-Riquelme
- Genomic Medicine Department, GENYO, Centre for Genomics and Oncological Research, Pfizer/University of Granada/Andalusian Regional Government, PTS 18016, Granada, Spain.,Institute for Environmental Medicine, Karolinska Institutet, Stockholm, Sweden
| | - Concepción Marañón
- Genomic Medicine Department, GENYO, Centre for Genomics and Oncological Research, Pfizer/University of Granada/Andalusian Regional Government, PTS 18016, Granada, Spain
| |
Collapse
|
29
|
Liu JX, Wang D, Gao YL, Zheng CH, Xu Y, Yu J. Regularized Non-Negative Matrix Factorization for Identifying Differentially Expressed Genes and Clustering Samples: A Survey. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:974-987. [PMID: 28186906 DOI: 10.1109/tcbb.2017.2665557] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Non-negative Matrix Factorization (NMF), a classical method for dimensionality reduction, has been applied in many fields. It is based on the idea that negative numbers are physically meaningless in various data-processing tasks. Apart from its contribution to conventional data analysis, the recent overwhelming interest in NMF is due to its newly discovered ability to solve challenging data mining and machine learning problems, especially in relation to gene expression data. This survey paper mainly focuses on research examining the application of NMF to identify differentially expressed genes and to cluster samples, and the main NMF models, properties, principles, and algorithms with its various generalizations, extensions, and modifications are summarized. The experimental results demonstrate the performance of the various NMF algorithms in identifying differentially expressed genes and clustering samples.
Collapse
|
30
|
|
31
|
Gu Q, Veselkov K. Bi-clustering of metabolic data using matrix factorization tools. Methods 2018; 151:12-20. [PMID: 29438828 PMCID: PMC6297113 DOI: 10.1016/j.ymeth.2018.02.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 02/04/2018] [Accepted: 02/06/2018] [Indexed: 01/08/2023] Open
Abstract
We propose a positive matrix factorization bi-clustering strategy for metabolic data. The approach automatically determines the number and composition of bi-clusters. We demonstrate its superior performance compared to other techniques.
Metabolic phenotyping technologies based on Nuclear Magnetic Spectroscopy (NMR) and Mass Spectrometry (MS) generate vast amounts of unrefined data from biological samples. Clustering strategies are frequently employed to provide insight into patterns of relationships between samples and metabolites. Here, we propose the use of a non-negative matrix factorization driven bi-clustering strategy for metabolic phenotyping data in order to discover subsets of interrelated metabolites that exhibit similar behaviour across subsets of samples. The proposed strategy incorporates bi-cross validation and statistical segmentation techniques to automatically determine the number and structure of bi-clusters. This alternative approach is in contrast to the widely used conventional clustering approaches that incorporate all molecular peaks for clustering in metabolic studies and require a priori specification of the number of clusters. We perform the comparative analysis of the proposed strategy with other bi-clustering approaches, which were developed in the context of genomics and transcriptomics research. We demonstrate the superior performance of the proposed bi-clustering strategy on both simulated (NMR) and real (MS) bacterial metabolic data.
Collapse
Affiliation(s)
- Quan Gu
- MRC-University of Glasgow Centre for Virus Research, University of Glasgow, Garscube Estate, Glasgow G61 1QH, UK
| | - Kirill Veselkov
- Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, Sir Alexander Fleming Building, Exhibition Road, South Kensington, London SW7 2AZ, UK.
| |
Collapse
|
32
|
Abd Elaziz ME. Simultaneous feature extraction and selection of microarray data using fuzzy-rough based multiobjective nonnegative matrix factorization. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2017. [DOI: 10.3233/jifs-17954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
33
|
Scheltens NME, Tijms BM, Koene T, Barkhof F, Teunissen CE, Wolfsgruber S, Wagner M, Kornhuber J, Peters O, Cohn-Sheehy BI, Rabinovici GD, Miller BL, Kramer JH, Scheltens P, van der Flier WM. Cognitive subtypes of probable Alzheimer's disease robustly identified in four cohorts. Alzheimers Dement 2017; 13:1226-1236. [PMID: 28427934 PMCID: PMC5857387 DOI: 10.1016/j.jalz.2017.03.002] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2016] [Revised: 03/09/2017] [Accepted: 03/09/2017] [Indexed: 01/25/2023]
Abstract
INTRODUCTION Patients with Alzheimer's disease (AD) show heterogeneity in profile of cognitive impairment. We aimed to identify cognitive subtypes in four large AD cohorts using a data-driven clustering approach. METHODS We included probable AD dementia patients from the Amsterdam Dementia Cohort (n = 496), Alzheimer's Disease Neuroimaging Initiative (n = 376), German Dementia Competence Network (n = 521), and University of California, San Francisco (n = 589). Neuropsychological data were clustered using nonnegative matrix factorization. We explored clinical and neurobiological characteristics of identified clusters. RESULTS In each cohort, a two-clusters solution best fitted the data (cophenetic correlation >0.9): one cluster was memory-impaired and the other relatively memory spared. Pooled analyses showed that the memory-spared clusters (29%-52% of patients) were younger, more often apolipoprotein E (APOE) ɛ4 negative, and had more severe posterior atrophy compared with the memory-impaired clusters (all P < .05). CONCLUSIONS We could identify two robust cognitive clusters in four independent large cohorts with distinct clinical characteristics.
Collapse
Affiliation(s)
- Nienke M. E. Scheltens
- Department of Neurology, Alzheimer Center, Amsterdam Neuroscience, VU University Medical Center, Amsterdam, The Netherlands
| | - Betty M. Tijms
- Department of Neurology, Alzheimer Center, Amsterdam Neuroscience, VU University Medical Center, Amsterdam, The Netherlands
| | - Teddy Koene
- Department of Medical Psychology, VU University Medical Center, Amsterdam, The Netherlands
| | - Frederik Barkhof
- Department of Radiology and Nuclear Medicine, Amsterdam Neuroscience, VU University Medical Center, Amsterdam, The Netherlands
- Institute of Neurology, University College London, London, UK
- Institute of Healthcare Engineering, University College London, London, UK
| | - Charlotte E. Teunissen
- Neurochemistry Laboratory and Biobank, Department of Clinical Chemistry, Amsterdam Neuroscience, VU University Medical Centre, Amsterdam, The Netherlands
| | - Steffen Wolfsgruber
- Department of Psychiatry, University of Bonn, Bonn, Germany
- German Center for Neurodegenerative Diseases, Bonn, Germany
| | - Michael Wagner
- Department of Psychiatry, University of Bonn, Bonn, Germany
- German Center for Neurodegenerative Diseases, Bonn, Germany
| | - Johannes Kornhuber
- Department of Psychiatry, Friedrich-Alexander-University Erlangen, Erlangen, Germany
| | - Oliver Peters
- Department of Psychiatry, Charité Berlin, Campus Benjamin Franklin, Berlin, Germany
| | - Brendan I. Cohn-Sheehy
- Memory and Aging Center, Department of Neurology, University of California San Francisco, San Francisco, CA, USA
| | - Gil D. Rabinovici
- Memory and Aging Center, Department of Neurology, University of California San Francisco, San Francisco, CA, USA
| | - Bruce L. Miller
- Memory and Aging Center, Department of Neurology, University of California San Francisco, San Francisco, CA, USA
| | - Joel H. Kramer
- Memory and Aging Center, Department of Neurology, University of California San Francisco, San Francisco, CA, USA
| | - Philip Scheltens
- Department of Neurology, Alzheimer Center, Amsterdam Neuroscience, VU University Medical Center, Amsterdam, The Netherlands
| | - Wiesje M. van der Flier
- Department of Neurology, Alzheimer Center, Amsterdam Neuroscience, VU University Medical Center, Amsterdam, The Netherlands
- Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands
| | | | | |
Collapse
|
34
|
Li X, Ma S, Wong KC. Evolving Spatial Clusters of Genomic Regions From High-Throughput Chromatin Conformation Capture Data. IEEE Trans Nanobioscience 2017; 16:400-407. [DOI: 10.1109/tnb.2017.2725991] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
35
|
Ray B, Liu W, Fenyö D. Adaptive Multiview Nonnegative Matrix Factorization Algorithm for Integration of Multimodal Biomedical Data. Cancer Inform 2017; 16:1176935117725727. [PMID: 28835735 PMCID: PMC5564898 DOI: 10.1177/1176935117725727] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Accepted: 07/08/2017] [Indexed: 11/16/2022] Open
Abstract
The amounts and types of available multimodal tumor data are rapidly increasing, and their integration is critical for fully understanding the underlying cancer biology and personalizing treatment. However, the development of methods for effectively integrating multimodal data in a principled manner is lagging behind our ability to generate the data. In this article, we introduce an extension to a multiview nonnegative matrix factorization algorithm (NNMF) for dimensionality reduction and integration of heterogeneous data types and compare the predictive modeling performance of the method on unimodal and multimodal data. We also present a comparative evaluation of our novel multiview approach and current data integration methods. Our work provides an efficient method to extend an existing dimensionality reduction method. We report rigorous evaluation of the method on large-scale quantitative protein and phosphoprotein tumor data from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) acquired using state-of-the-art liquid chromatography mass spectrometry. Exome sequencing and RNA-Seq data were also available from The Cancer Genome Atlas for the same tumors. For unimodal data, in case of breast cancer, transcript levels were most predictive of estrogen and progesterone receptor status and copy number variation of human epidermal growth factor receptor 2 status. For ovarian and colon cancers, phosphoprotein and protein levels were most predictive of tumor grade and stage and residual tumor, respectively. When multiview NNMF was applied to multimodal data to predict outcomes, the improvement in performance is not overall statistically significant beyond unimodal data, suggesting that proteomics data may contain more predictive information regarding tumor phenotypes than transcript levels, probably due to the fact that proteins are the functional gene products and therefore a more direct measurement of the functional state of the tumor. Here, we have applied our proposed approach to multimodal molecular data for tumors, but it is generally applicable to dimensionality reduction and joint analysis of any type of multimodal data.
Collapse
Affiliation(s)
- Bisakha Ray
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| | - Wenke Liu
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| | - David Fenyö
- Institute for Systems Genetics and Department of Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA
| |
Collapse
|
36
|
Yu G, Yu X, Wang J. Network-aided Bi-Clustering for discovering cancer subtypes. Sci Rep 2017; 7:1046. [PMID: 28432308 PMCID: PMC5430742 DOI: 10.1038/s41598-017-01064-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2016] [Accepted: 03/28/2017] [Indexed: 12/18/2022] Open
Abstract
Bi-clustering is a widely used data mining technique for analyzing gene expression data. It simultaneously groups genes and samples of an input gene expression data matrix to discover bi-clusters that relevant samples exhibit similar gene expression profiles over a subset of genes. The discovered bi-clusters bring insights for categorization of cancer subtypes, gene treatments and others. Most existing bi-clustering approaches can only enumerate bi-clusters with constant values. Gene interaction networks can help to understand the pattern of cancer subtypes, but they are rarely integrated with gene expression data for exploring cancer subtypes. In this paper, we propose a novel method called Network-aided Bi-Clustering (NetBC). NetBC assigns weights to genes based on the structure of gene interaction network, and it iteratively optimizes sum-squared residue to obtain the row and column indicative matrices of bi-clusters by matrix factorization. NetBC can not only efficiently discover bi-clusters with constant values, but also bi-clusters with coherent trends. Empirical study on large-scale cancer gene expression datasets demonstrates that NetBC can more accurately discover cancer subtypes than other related algorithms.
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Xianxue Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China.
| |
Collapse
|
37
|
Shao C, Höfer T. Robust classification of single-cell transcriptome data by nonnegative matrix factorization. Bioinformatics 2016; 33:235-242. [DOI: 10.1093/bioinformatics/btw607] [Citation(s) in RCA: 76] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 09/15/2016] [Accepted: 09/16/2016] [Indexed: 11/14/2022] Open
|
38
|
Stražar M, Žitnik M, Zupan B, Ule J, Curk T. Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins. Bioinformatics 2016; 32:1527-35. [PMID: 26787667 PMCID: PMC4894278 DOI: 10.1093/bioinformatics/btw003] [Citation(s) in RCA: 74] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Accepted: 01/01/2016] [Indexed: 12/15/2022] Open
Abstract
Motivation: RNA binding proteins (RBPs) play important roles in post-transcriptional control of gene expression, including splicing, transport, polyadenylation and RNA stability. To model protein–RNA interactions by considering all available sources of information, it is necessary to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure. Such integration is possible by matrix factorization, where current approaches have an undesired tendency to identify only a small number of the strongest patterns with overlapping features. Because protein–RNA interactions are orchestrated by multiple factors, methods that identify discriminative patterns of varying strengths are needed. Results: We have developed an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. The orthogonality constraint halves the effective size of the factor model and outperforms other NMF models in predicting RBP interaction sites on RNA. We have integrated the largest data compendium to date, which includes 31 CLIP experiments on 19 RBPs involved in splicing (such as hnRNPs, U2AF2, ELAVL1, TDP-43 and FUS) and processing of 3’UTR (Ago, IGF2BP). We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites. In our study the key predictive factors of protein–RNA interactions were the position of RNA structure and sequence motifs, RBP co-binding and gene region type. We report on a number of protein-specific patterns, many of which are consistent with experimentally determined properties of RBPs. Availability and implementation: The iONMF implementation and example datasets are available at https://github.com/mstrazar/ionmf. Contact: tomaz.curk@fri.uni-lj.si Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin Stražar
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| | - Marinka Žitnik
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| | - Blaž Zupan
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jernej Ule
- Department of Molecular Neuroscience, UCL Institute of Neurology, Queen Square, London WC1N 3BG, UK
| | - Tomaž Curk
- University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, SI 1000, Slovenia
| |
Collapse
|
39
|
Pontes B, Giráldez R, Aguilar-Ruiz JS. Biclustering on expression data: A review. J Biomed Inform 2015; 57:163-80. [PMID: 26160444 DOI: 10.1016/j.jbi.2015.06.028] [Citation(s) in RCA: 165] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2015] [Revised: 06/22/2015] [Accepted: 06/30/2015] [Indexed: 11/28/2022]
Abstract
Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.
Collapse
Affiliation(s)
- Beatriz Pontes
- Department of Languages and Computer Systems, University of Seville, Seville, Spain.
| | - Raúl Giráldez
- School of Engineering, Pablo de Olavide University, Seville, Spain.
| | | |
Collapse
|
40
|
Mejía-Roa E, Tabas-Madrid D, Setoain J, García C, Tirado F, Pascual-Montano A. NMF-mGPU: non-negative matrix factorization on multi-GPU systems. BMC Bioinformatics 2015; 16:43. [PMID: 25887585 PMCID: PMC4339678 DOI: 10.1186/s12859-015-0485-4] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 01/30/2015] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. RESULTS NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). CONCLUSIONS Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the daily work of bioinformaticians that are trying to extract biological meaning out of hundreds of gigabytes of experimental information. NMF-mGPU can be used "out of the box" by researchers with little or no expertise in GPU programming in a variety of platforms, such as PCs, laptops, or high-end GPU clusters. NMF-mGPU is freely available at https://github.com/bioinfo-cnb/bionmf-gpu .
Collapse
Affiliation(s)
- Edgardo Mejía-Roa
- ArTeCS Group, Department of Computer Architecture, Complutense University of Madrid (UCM), Madrid, 28040, Spain.
| | - Daniel Tabas-Madrid
- Functional Bioinformatics Group, Biocomputing Unit, National Center for Biotechnology-CSIC, UAM, Madrid, 28049, Spain.
| | - Javier Setoain
- Functional Bioinformatics Group, Biocomputing Unit, National Center for Biotechnology-CSIC, UAM, Madrid, 28049, Spain.
| | - Carlos García
- ArTeCS Group, Department of Computer Architecture, Complutense University of Madrid (UCM), Madrid, 28040, Spain.
| | - Francisco Tirado
- ArTeCS Group, Department of Computer Architecture, Complutense University of Madrid (UCM), Madrid, 28040, Spain.
| | - Alberto Pascual-Montano
- Functional Bioinformatics Group, Biocomputing Unit, National Center for Biotechnology-CSIC, UAM, Madrid, 28049, Spain.
| |
Collapse
|
41
|
|
42
|
|
43
|
Abstract
BACKGROUND Identifying modules from time series biological data helps us understand biological functionalities of a group of proteins/genes interacting together and how responses of these proteins/genes dynamically change with respect to time. With rapid acquisition of time series biological data from different laboratories or databases, new challenges are posed for the identification task and powerful methods which are able to detect modules with integrative analysis are urgently called for. To accomplish such integrative analysis, we assemble multiple time series biological data into a higher-order form, e.g., a gene × condition × time tensor. It is interesting and useful to develop methods to identify modules from this tensor. RESULTS In this paper, we present MultiFacTV, a new method to find modules from higher-order time series biological data. This method employs a tensor factorization objective function where a time-related total variation regularization term is incorporated. According to factorization results, MultiFacTV extracts modules that are composed of some genes, conditions and time-points. We have performed MultiFacTV on synthetic datasets and the results have shown that MultiFacTV outperforms existing methods EDISA and Metafac. Moreover, we have applied MultiFacTV to Arabidopsis thaliana root(shoot) tissue dataset represented as a gene × condition × time tensor of size 2395 × 9 × 6(3454 × 8 × 6), to Yeast dataset and Homo sapiens dataset represented as tensors of sizes 4425 × 6 × 6 and 2920 × 14 × 9 respectively. The results have shown that MultiFacTV indeed identifies some interesting modules in these datasets, which have been validated and explained by Gene Ontology analysis with DAVID or other analysis. CONCLUSION Experimental results on both synthetic datasets and real datasets show that the proposed MultiFacTV is effective in identifying modules for higher-order time series biological data. It provides, compared to traditional non-integrative analysis methods, a more comprehensive and better view on biological process since modules composed of more than two types of biological variables could be identified and analyzed.
Collapse
Affiliation(s)
- Xutao Li
- Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, 518055, China
- Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, 518055, China
| | - Yunming Ye
- Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, 518055, China
- Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, 518055, China
| | - Michael Ng
- Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Qingyao Wu
- Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, 518055, China
- Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, 518055, China
| |
Collapse
|
44
|
Liao R, Zhang Y, Guan J, Zhou S. CloudNMF: a MapReduce implementation of nonnegative matrix factorization for large-scale biological datasets. GENOMICS PROTEOMICS & BIOINFORMATICS 2013; 12:48-51. [PMID: 23933456 PMCID: PMC4411332 DOI: 10.1016/j.gpb.2013.06.001] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Revised: 06/21/2013] [Accepted: 06/26/2013] [Indexed: 12/03/2022]
Abstract
In the past decades, advances in high-throughput technologies have led to the generation of huge amounts of biological data that require analysis and interpretation. Recently, nonnegative matrix factorization (NMF) has been introduced as an efficient way to reduce the complexity of data as well as to interpret them, and has been applied to various fields of biological research. In this paper, we present CloudNMF, a distributed open-source implementation of NMF on a MapReduce framework. Experimental evaluation demonstrated that CloudNMF is scalable and can be used to deal with huge amounts of data, which may enable various kinds of a high-throughput biological data analysis in the cloud. CloudNMF is freely accessible at http://admis.fudan.edu.cn/projects/CloudNMF.html.
Collapse
Affiliation(s)
- Ruiqi Liao
- School of Computer Science, Fudan University, Shanghai 200433, China
| | - Yifan Zhang
- School of Computer Science, Fudan University, Shanghai 200433, China
| | - Jihong Guan
- Department of Computer Science and Technology, Tongji University, Shanghai 200092, China
| | - Shuigeng Zhou
- School of Computer Science, Fudan University, Shanghai 200433, China; Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China.
| |
Collapse
|
45
|
Chen HC, Zou W, Tien YJ, Chen JJ. Identification of bicluster regions in a binary matrix and its applications. PLoS One 2013; 8:e71680. [PMID: 23940779 PMCID: PMC3733970 DOI: 10.1371/journal.pone.0071680] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2012] [Accepted: 07/09/2013] [Indexed: 11/18/2022] Open
Abstract
Biclustering has emerged as an important approach to the analysis of large-scale datasets. A biclustering technique identifies a subset of rows that exhibit similar patterns on a subset of columns in a data matrix. Many biclustering methods have been proposed, and most, if not all, algorithms are developed to detect regions of "coherence" patterns. These methods perform unsatisfactorily if the purpose is to identify biclusters of a constant level. This paper presents a two-step biclustering method to identify constant level biclusters for binary or quantitative data. This algorithm identifies the maximal dimensional submatrix such that the proportion of non-signals is less than a pre-specified tolerance δ. The proposed method has much higher sensitivity and slightly lower specificity than several prominent biclustering methods from the analysis of two synthetic datasets. It was further compared with the Bimax method for two real datasets. The proposed method was shown to perform the most robust in terms of sensitivity, number of biclusters and number of serotype-specific biclusters identified. However, dichotomization using different signal level thresholds usually leads to different sets of biclusters; this also occurs in the present analysis.
Collapse
Affiliation(s)
- Hung-Chia Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
- Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan
| | - Wen Zou
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
| | - Yin-Jing Tien
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - James J. Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas, United States of America
- Graduate Institute of Biostatistics and Biostatistics Center, China Medical University, Taichung, Taiwan
| |
Collapse
|
46
|
Lai Y, Hayashida M, Akutsu T. Survival analysis by penalized regression and matrix factorization. ScientificWorldJournal 2013; 2013:632030. [PMID: 23737722 PMCID: PMC3655687 DOI: 10.1155/2013/632030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2013] [Accepted: 04/03/2013] [Indexed: 11/18/2022] Open
Abstract
Because every disease has its unique survival pattern, it is necessary to find a suitable model to simulate followups. DNA microarray is a useful technique to detect thousands of gene expressions at one time and is usually employed to classify different types of cancer. We propose combination methods of penalized regression models and nonnegative matrix factorization (NMF) for predicting survival. We tried L1- (lasso), L2- (ridge), and L1-L2 combined (elastic net) penalized regression for diffuse large B-cell lymphoma (DLBCL) patients' microarray data and found that L1-L2 combined method predicts survival best with the smallest logrank P value. Furthermore, 80% of selected genes have been reported to correlate with carcinogenesis or lymphoma. Through NMF we found that DLBCL patients can be divided into 4 groups clearly, and it implies that DLBCL may have 4 subtypes which have a little different survival patterns. Next we excluded some patients who were indicated hard to classify in NMF and executed three penalized regression models again. We found that the performance of survival prediction has been improved with lower logrank P values. Therefore, we conclude that after preselection of patients by NMF, penalized regression models can predict DLBCL patients' survival successfully.
Collapse
Affiliation(s)
- Yeuntyng Lai
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| | - Morihiro Hayashida
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| |
Collapse
|
47
|
Li Y, Ngom A. The non-negative matrix factorization toolbox for biological data mining. SOURCE CODE FOR BIOLOGY AND MEDICINE 2013; 8:10. [PMID: 23591137 PMCID: PMC3736608 DOI: 10.1186/1751-0473-8-10] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2012] [Accepted: 04/10/2013] [Indexed: 01/06/2023]
Abstract
Background Non-negative matrix factorization (NMF) has been introduced as an important method for mining biological data. Though there currently exists packages implemented in R and other programming languages, they either provide only a few optimization algorithms or focus on a specific application field. There does not exist a complete NMF package for the bioinformatics community, and in order to perform various data mining tasks on biological data. Results We provide a convenient MATLAB toolbox containing both the implementations of various NMF techniques and a variety of NMF-based data mining approaches for analyzing biological data. Data mining approaches implemented within the toolbox include data clustering and bi-clustering, feature extraction and selection, sample classification, missing values imputation, data visualization, and statistical comparison. Conclusions A series of analysis such as molecular pattern discovery, biological process identification, dimension reduction, disease prediction, visualization, and statistical comparison can be performed using this toolbox.
Collapse
Affiliation(s)
- Yifeng Li
- School of Computer Science, University of Windsor, Windsor, Ontario, Canada.
| | | |
Collapse
|
48
|
Wang JJY, Wang X, Gao X. Non-negative matrix factorization by maximizing correntropy for cancer clustering. BMC Bioinformatics 2013; 14:107. [PMID: 23522344 PMCID: PMC3659102 DOI: 10.1186/1471-2105-14-107] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2012] [Accepted: 03/08/2013] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND Non-negative matrix factorization (NMF) has been shown to be a powerful tool for clustering gene expression data, which are widely used to classify cancers. NMF aims to find two non-negative matrices whose product closely approximates the original matrix. Traditional NMF methods minimize either the l2 norm or the Kullback-Leibler distance between the product of the two matrices and the original matrix. Correntropy was recently shown to be an effective similarity measurement due to its stability to outliers or noise. RESULTS We propose a maximum correntropy criterion (MCC)-based NMF method (NMF-MCC) for gene expression data-based cancer clustering. Instead of minimizing the l2 norm or the Kullback-Leibler distance, NMF-MCC maximizes the correntropy between the product of the two matrices and the original matrix. The optimization problem can be solved by an expectation conditional maximization algorithm. CONCLUSIONS Extensive experiments on six cancer benchmark sets demonstrate that the proposed method is significantly more accurate than the state-of-the-art methods in cancer clustering.
Collapse
Affiliation(s)
- Jim Jing-Yan Wang
- Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | | | | |
Collapse
|
49
|
Wang YK, Print CG, Crampin EJ. Biclustering reveals breast cancer tumour subgroups with common clinical features and improves prediction of disease recurrence. BMC Genomics 2013; 14:102. [PMID: 23405961 PMCID: PMC3598775 DOI: 10.1186/1471-2164-14-102] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2012] [Accepted: 02/05/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many studies have revealed correlations between breast tumour phenotypes, variations in gene expression, and patient survival outcomes. The molecular heterogeneity between breast tumours revealed by these studies has allowed prediction of prognosis and has underpinned stratified therapy, where groups of patients with particular tumour types receive specific treatments. The molecular tests used to predict prognosis and stratify treatment usually utilise fixed sets of genomic biomarkers, with the same biomarker sets being used to test all patients. In this paper we suggest that instead of fixed sets of genomic biomarkers, it may be more effective to use a stratified biomarker approach, where optimal biomarker sets are automatically chosen for particular patient groups, analogous to the choice of optimal treatments for groups of similar patients in stratified therapy. We illustrate the effectiveness of a biclustering approach to select optimal gene sets for determining the prognosis of specific strata of patients, based on potentially overlapping, non-discrete molecular characteristics of tumours. RESULTS Biclustering identified tightly co-expressed gene sets in the tumours of restricted subgroups of breast cancer patients. The co-expressed genes in these biclusters were significantly enriched for particular biological annotations and gene regulatory modules associated with breast cancer biology. Tumours identified within the same bicluster were more likely to present with similar clinical features. Bicluster membership combined with clinical information could predict patient prognosis in conditional inference tree and ridge regression class prediction models. CONCLUSIONS The increasing clinical use of genomic profiling demands identification of more effective methods to segregate patients into prognostic and treatment groups. We have shown that biclustering can be used to select optimal gene sets for determining the prognosis of specific strata of patients.
Collapse
Affiliation(s)
- Yi Kan Wang
- Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand
| | - Cristin G Print
- Department of Molecular Medicine and Pathology, University of Auckland, Auckland, New Zealand
- New Zealand Bioinformatics Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
| | - Edmund J Crampin
- Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
- Department of Engineering Science, University of Auckland, Auckland, New Zealand
- Melbourne School of Engineering, University of Melbourne, Victoria, Australia
| |
Collapse
|
50
|
Chen HC, Tsong Y, Chen JJ. Data Mining for Signal Detection of Adverse Event Safety Data. J Biopharm Stat 2013; 23:146-60. [DOI: 10.1080/10543406.2013.735780] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Hung-Chia Chen
- a Division of Bioinformatics and Biostatistics , National Center for Toxicological Research, U.S. Food and Drug Administration , Jefferson , Arkansas , USA
- b Graduate Institute of Biostatistics and Biostatistics Center , China Medical University , Taichung , Taiwan
| | - Yi Tsong
- c Office of Biostatistics, DB6 , Center for Drug Evaluation Research, U.S. Food and Drug Administration , Silver Spring , Maryland , USA
| | - James J. Chen
- a Division of Bioinformatics and Biostatistics , National Center for Toxicological Research, U.S. Food and Drug Administration , Jefferson , Arkansas , USA
| |
Collapse
|