1
|
Qin Z, Yang L, Gao F, Hu Q, Shen C. Uncertainty-Aware Aggregation for Federated Open Set Domain Adaptation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:7548-7562. [PMID: 36306293 DOI: 10.1109/tnnls.2022.3214930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Open set domain adaptation (OSDA) methods have been proposed to leverage the difference between the source and target domains, as well as to recognize the known and unknown classes in the target domain. Such methods typically require the entire source and target data simultaneously to train the target model. However, in real scenarios, data are distributed and stored in various clients. They cannot be exchanged among clients because of privacy protection. Federated learning (FL) is a decentralized approach for training an effective global model with the training data distributed among the clients. Despite its potential in addressing the privacy concerns of data sharing, FL methods for OSDA that can handle unknown classes is not yet available. To tackle this problem, we have developed a novel federated OSDA (FOSDA) algorithm. More specifically, FOSDA adopts an uncertainty-aware mechanism to generate a global model from all client models. It reduces the uncertainty of the federated aggregation by focusing on the contribution of source clients with high uncertainty while retaining those with high consistency. Moreover, a federated class-based weighted strategy is also implemented in FOSDA to maintain the category information of the source clients. We have conducted comprehensive experiments on three benchmark datasets to evaluate the performance of the proposed method, and the results demonstrate the effectiveness of FOSDA.
Collapse
|
2
|
Meng J, Jiang A, Lu X, Gu D, Ge Q, Bai S, Zhou Y, Zhou J, Hao Z, Yan F, Wang L, Wang H, Du J, Liang C. Multiomics characterization and verification of clear cell renal cell carcinoma molecular subtypes to guide precise chemotherapy and immunotherapy. IMETA 2023; 2:e147. [PMID: 38868222 PMCID: PMC10989995 DOI: 10.1002/imt2.147] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 10/21/2023] [Indexed: 06/14/2024]
Abstract
Clear cell renal cell carcinoma (ccRCC) is a heterogeneous tumor with different genetic and molecular alterations. Schemes for ccRCC classification system based on multiomics are urgent, to promote further biological insights. Two hundred and fifty-five ccRCC patients with paired data of clinical information, transcriptome expression profiles, copy number alterations, DNA methylation, and somatic mutations were collected for identification. Bioinformatic analyses were performed based on our team's recently developed R package "MOVICS." With 10 state-of-the-art algorithms, we identified the multiomics subtypes (MoSs) for ccRCC patients. MoS1 is an immune exhausted subtype, presented the poorest prognosis, and might be caused by an exhausted immune microenvironment, activated hypoxia features, but can benefit from PI3K/AKT inhibitors. MoS2 is an immune "cold" subtype, which represented more mutation of VHL and PBRM1, favorable prognosis, and is more suitable for sunitinib therapy. MoS3 is the immune "hot" subtype, and can benefit from the anti-PD-1 immunotherapy. We successfully verified the different molecular features of the three MoSs in external cohorts GSE22541, GSE40435, and GSE53573. Patients that received Nivolumab therapy helped us to confirm that MoS3 is suitable for anti-PD-1 therapy. E-MTAB-3267 cohort also supported the fact that MoS2 patients can respond more to sunitinib treatment. We also confirm that SETD2 is a tumor suppressor in ccRCC, along with the decreased SETD2 protein level in advanced tumor stage, and knock-down of SETD2 leads to the promotion of cell proliferation, migration, and invasion. In summary, we provide novel insights into ccRCC molecular subtypes based on robust clustering algorithms via multiomics data, and encourage future precise treatment of ccRCC patients.
Collapse
Affiliation(s)
- Jialin Meng
- Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
| | - Aimin Jiang
- Department of Urology, Changhai HospitalNaval Medical University (Second Military Medical University)ShanghaiChina
| | - Xiaofan Lu
- Department of Cancer and Functional GenomicsInstitute of Genetics and Molecular and Cellular Biology, CNRS/INSERM/UNISTRAIllkirchFrance
| | - Di Gu
- Department of Urology, Changhai HospitalNaval Medical University (Second Military Medical University)ShanghaiChina
| | - Qintao Ge
- Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
| | - Suwen Bai
- The Second Affiliated Hospital, School of MedicineThe Chinese University of Hong Kong, Shenzhen & Longgang District People's Hospital of ShenzhenShenzhenChina
| | - Yundong Zhou
- Department of Surgery, Ningbo Medical Center Lihuili HospitalNingbo UniversityNingboZhejiangChina
| | - Jun Zhou
- Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
| | - Zongyao Hao
- Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
| | - Fangrong Yan
- Research Center of Biostatistics and Computational PharmacyChina Pharmaceutical UniversityNanjingChina
| | - Linhui Wang
- Department of Urology, Changhai HospitalNaval Medical University (Second Military Medical University)ShanghaiChina
| | - Haitao Wang
- Cancer Center, Faculty of Health SciencesUniversity of MacauMacau SARChina
- Present address:
Center for Cancer ResearchBethesdaMarylandUSA
| | - Juan Du
- The Second Affiliated Hospital, School of MedicineThe Chinese University of Hong Kong, Shenzhen & Longgang District People's Hospital of ShenzhenShenzhenChina
| | - Chaozhao Liang
- Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
| |
Collapse
|
3
|
Using a national level cross-sectional study to develop a Hospital Preparedness Index (HOSPI) for Covid-19 management: A case study from India. PLoS One 2022; 17:e0269842. [PMID: 35895724 PMCID: PMC9328545 DOI: 10.1371/journal.pone.0269842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 05/29/2022] [Indexed: 11/26/2022] Open
Abstract
Background We developed a composite index–hospital preparedness index (HOSPI)–to gauge preparedness of hospitals in India to deal with COVID-19 pandemic. Methods We developed and validated a comprehensive survey questionnaire containing 63 questions, out of which 16 critical items were identified and classified under 5 domains: staff preparedness, effects of COVID-19, protective gears, infrastructure, and future planning. Hospitals empaneled under Ayushman Bharat Yojana (ABY) were invited to the survey. The responses were analyzed using weighted negative log likelihood scores for the options. The preparedness of hospitals was ranked after averaging the scores state-wise and district-wise in select states. HOSPI scores for states were classified using K-means clustering. Findings Out of 20,202 hospitals empaneled in ABY included in the study, a total of 954 hospitals responded to the questionnaire by July 2020. Domains 1, 2, and 4 contributed the most to the index. The overall preparedness was identified as the best in Goa, and 12 states/ UTs had scores above the national average score. Among the states which experienced high COVID-19 cases during the first pandemic wave, we identified a cluster of states with high HOSPI scores indicating better preparedness (Maharashtra, Tamil Nadu, Karnataka, Uttar Pradesh and Andhra Pradesh), and a cluster with low HOSPI scores indicating poor preparedness (Chhattisgarh, Delhi, Uttarakhand). Interpretation Using this index, it is possible to identify areas for targeted improvement of hospital and staff preparedness to deal with the COVID-19 crisis.
Collapse
|
4
|
Petrillo UF, Palini F, Cattaneo G, Giancarlo R. Alignment-free Genomic Analysis via a Big Data Spark Platform. Bioinformatics 2021; 37:1658-1665. [PMID: 33471066 DOI: 10.1093/bioinformatics/btab014] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 12/28/2020] [Accepted: 01/06/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. RESULTS We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE. AVAILABILITY The software and the datasets are available at https://github.com/fpalini/fade. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Francesco Palini
- Dipartimento di Scienze Statistiche, Università di Roma - La Sapienza, Rome, 00185, Italy
| | - Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano (SA), 84084, Italy
| | - Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Università di Palermo, Palermo, 90133, Italy
| |
Collapse
|
5
|
Guttula PK, Gupta MK. Examining the co-expression, transcriptome clustering and variation using fuzzy cluster network of testicular stem cells and pluripotent stem cells compared with other cell types. Comput Biol Chem 2020; 85:107227. [PMID: 32044562 DOI: 10.1016/j.compbiolchem.2020.107227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Revised: 10/10/2019] [Accepted: 01/31/2020] [Indexed: 10/25/2022]
Abstract
Stem cells are crucial in the field of tissue regeneration and developmental biology. Embryonic stem cells (ESCs) which are pluripotent in nature are derived from the inner cell mass of blastocyst. The gene expression profiles of ESCs and Induced pluripotent stem cells (iPSCs) were compared to identify the differences. Spermatogonial stem cells (SSCs) are also known as Germ-line stem cells (GSCs) present in testis is having the capability of producing the sperm in their whole lifetime. Therefore can be reprogrammed into pluripotent cells called male germline pluripotent cells (gPSCs). It is very difficult to interpret the larger genomic data sets which are available in public databases without high computational facilities. In order to identify the similar groups We studied the co-expression, clustering of the transcriptome and variation of the transcriptome of the GSCs, gPSCs, ESCs and other cell types using fuzzy clustering using AutoSOME. The series matrix file with GSE ID GSE11274 was retrieved and subjected to the various normalization methods, corresponding rows and columns were clustered using p values, ensemble runs, and different running modes. Transcriptome analysis using the proposed approach intuitively and consistently characterized the variation in cell-cell significantly. Collectively, our results suggest that the GSCs and the ESCs displayed differential gene expression profiles, and the GSCs possessed the potential to acquire pluripotency based on the high expression of epigenetic factors and transcription factors. These data may provide novel insights into the reprogramming mechanism of GSCs.
Collapse
Affiliation(s)
- Praveen Kumar Guttula
- Gene Manipulation Laboratory, Department of Biotechnology and Medical Engineering, National Institute of Technology, Rourkela, 769008, India
| | - Mukesh Kumar Gupta
- Gene Manipulation Laboratory, Department of Biotechnology and Medical Engineering, National Institute of Technology, Rourkela, 769008, India.
| |
Collapse
|
6
|
Li X, LeBlanc J, Elashoff D, McHardy I, Tong M, Roth B, Ippoliti A, Barron G, McGovern D, McDonald K, Newberry R, Graeber T, Horvath S, Goodglick L, Braun J. Microgeographic Proteomic Networks of the Human Colonic Mucosa and Their Association With Inflammatory Bowel Disease. Cell Mol Gastroenterol Hepatol 2016; 2:567-583. [PMID: 28174738 PMCID: PMC5042708 DOI: 10.1016/j.jcmgh.2016.05.003] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 05/06/2016] [Indexed: 12/28/2022]
Abstract
BACKGROUND & AIMS Interactions between mucosal cell types, environmental stressors, and intestinal microbiota contribute to pathogenesis in inflammatory bowel disease (IBD). Here, we applied metaproteomics of the mucosal-luminal interface to study the disease-related biology of the human colonic mucosa. METHODS We recruited a discovery cohort of 51 IBD and non-IBD subjects endoscopically sampled by mucosal lavage at 6 colonic regions, and a validation cohort of 38 no-IBD subjects. Metaproteome data sets were produced for each sample and analyzed for association with colonic site and disease state using a suite of bioinformatic approaches. Localization of select proteins was determined by immunoblot analysis and immunohistochemistry of human endoscopic biopsy samples. RESULTS Co-occurrence analysis of the discovery cohort metaproteome showed that proteins at the mucosal surface clustered into modules with evidence of differential functional specialization (eg, iron regulation, microbial defense) and cellular origin (eg, epithelial or hemopoietic). These modules, validated in an independent cohort, were differentially associated spatially along the gastrointestinal tract, and 7 modules were associated selectively with non-IBD, ulcerative colitis, and/or Crohn's disease states. In addition, the detailed composition of certain modules was altered in disease vs healthy states. We confirmed the predicted spatial and disease-associated localization of 28 proteins representing 4 different disease-related modules by immunoblot and immunohistochemistry visualization, with evidence for their distribution as millimeter-scale microgeographic mosaic. CONCLUSIONS These findings suggest that the mucosal surface is a microgeographic mosaic of functional networks reflecting the local mucosal ecology, whose compositional differences in disease and healthy samples may provide a unique readout of physiologic and pathologic mucosal states.
Collapse
Key Words
- ANOVA, analysis of variance
- CD, Crohn’s disease
- Ecology
- HBD, human β-defensin
- HD5, human alpha defensin 5
- HNP, human neutrophil peptide
- HPLC, high-performance liquid chromatography
- IBD, inflammatory bowel disease
- IHC, immunohistochemistry
- Inflammatory Bowel Disease
- MALDI, matrix-assisted laser desorption/ionization
- MFN, mucosal functional network
- MLI, mucosal–luminal interface
- MS/MS, tandem mass spectrometry
- Metaproteomics
- Mucosal
- NLME, nonlinear mixed-effect model
- Networks
- PVCA, principal variance component analysis
- TOF, time of flight
- UC, ulcerative colitis
- WGCNA, weighted correlation network analysis
Collapse
Affiliation(s)
- Xiaoxiao Li
- Department of Molecular and Medical Pharmacology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California,Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California,Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
| | - James LeBlanc
- Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
| | - David Elashoff
- Department of Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
| | - Ian McHardy
- Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
| | - Maomeng Tong
- Department of Molecular and Medical Pharmacology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
| | - Bennett Roth
- Department of Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
| | - Andrew Ippoliti
- Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
| | - Gildardo Barron
- Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
| | - Dermot McGovern
- Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
| | - Keely McDonald
- Department of Internal Medicine, Washington University School of Medicine, St. Louis, Missouri
| | - Rodney Newberry
- Department of Internal Medicine, Washington University School of Medicine, St. Louis, Missouri
| | - Thomas Graeber
- Department of Molecular and Medical Pharmacology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
| | - Steve Horvath
- Department of Human Genetics and Biostatistics, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
| | - Lee Goodglick
- Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
| | - Jonathan Braun
- Department of Molecular and Medical Pharmacology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California,Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California,Correspondence Address correspondence to: Jonathan Braun, MD, PhD, Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California 90095. fax: (310) 267-4486.Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of MedicineLos AngelesCalifornia 90095
| |
Collapse
|
7
|
Bai L, Wang F, Zhang DS, Li C, Jin Y, Wang DS, Chen DL, Qiu MZ, Luo HY, Wang ZQ, Li YH, Wang FH, Xu RH. A plasma cytokine and angiogenic factor (CAF) analysis for selection of bevacizumab therapy in patients with metastatic colorectal cancer. Sci Rep 2015; 5:17717. [PMID: 26620439 PMCID: PMC4664961 DOI: 10.1038/srep17717] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Accepted: 11/04/2015] [Indexed: 01/09/2023] Open
Abstract
This study intends to identify biomarkers that could refine the selection of patients with metastatic colorectal cancer (mCRC) for bevacizumab treatment. Pretreatment 36 plasma cytokines and angiogenic factors (CAFs) were first measured by protein microarray analysis in patients who received first-line bevacizumab-containing therapies (discovery cohort, n = 64), and further evaluated by enzyme-linked immunosorbent assay in patients treated on regimens with or without bevacizumab (validation cohort, n = 186). Factor levels were correlated with clinical outcomes, predictive values were assessed using a treatment by marker interaction term in the Cox model. Patients with lower pretreatment levels of hepatocyte growth factor (HGF) or VEGF-A121 gain much more benefit from bevacizumab treatment as measured by progression-free survival (PFS) and overall survival (OS), while angiopoietin-like 4 (ANGPTL4) levels negatively correlated with PFS and response rate following bevacizumab (all adjusted interaction P < 0.05). A baseline CAF signature combining these three markers has greater predictive ability than individual markers. Signature-negative patients showed impaired survival following bevacizumab treatment (PFS, 7.3 vs 7.0 months; hazard ratio [HR] 1.03; OS, 29.9 vs 21.1 months, HR 1.33) compared with signature-positive patients (PFS, 6.5 vs 11.9 months, HR 0.52; OS, 28.0 vs 55.3 months, HR 0.67). These promising results warrant further prospective studies.
Collapse
Affiliation(s)
- Long Bai
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Feng Wang
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Dong-Sheng Zhang
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Cong Li
- Department of Medical Oncology, Zhejiang Cancer Hospital, Hangzhou 310022, P. R. China
| | - Ying Jin
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - De-Shen Wang
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Dong-Liang Chen
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Miao-Zhen Qiu
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Hui-Yan Luo
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Zhi-Qiang Wang
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Yu-Hong Li
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Feng-Hua Wang
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| | - Rui-Hua Xu
- Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
| |
Collapse
|
8
|
Hu CW, Kornblau SM, Slater JH, Qutub AA. Progeny Clustering: A Method to Identify Biological Phenotypes. Sci Rep 2015; 5:12894. [PMID: 26267476 PMCID: PMC4533525 DOI: 10.1038/srep12894] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Accepted: 07/15/2015] [Indexed: 01/24/2023] Open
Abstract
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset.
Collapse
Affiliation(s)
| | - Steven M Kornblau
- Departments of Leukemia and Stem Cell Transplant, University of Texas MD Anderson Cancer Center
| | - John H Slater
- Department of Biomedical Engineering, University of Delaware
| | | |
Collapse
|
9
|
Giancarlo R, Scaturro D, Utro F. ValWorkBench: an open source Java library for cluster validation, with applications to microarray data analysis. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2015; 118:207-217. [PMID: 25582071 DOI: 10.1016/j.cmpb.2014.12.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2014] [Revised: 10/07/2014] [Accepted: 12/16/2014] [Indexed: 06/04/2023]
Abstract
The prediction of the number of clusters in a dataset, in particular microarrays, is a fundamental task in biological data analysis, usually performed via validation measures. Unfortunately, it has received very little attention and in fact there is a growing need for software tools/libraries dedicated to it. Here we present ValWorkBench, a software library consisting of eleven well known validation measures, together with novel heuristic approximations for some of them. The main objective of this paper is to provide the interested researcher with the full software documentation of an open source cluster validation platform having the main features of being easily extendible in a homogeneous way and of offering software components that can be readily re-used. Consequently, the focus of the presentation is on the architecture of the library, since it provides an essential map that can be used to access the full software documentation, which is available at the supplementary material website [1]. The mentioned main features of ValWorkBench are also discussed and exemplified, with emphasis on software abstraction design and re-usability. A comparison with existing cluster validation software libraries, mainly in terms of the mentioned features, is also offered. It suggests that ValWorkBench is a much needed contribution to the microarray software development/algorithm engineering community. For completeness, it is important to mention that previous accurate algorithmic experimental analysis of the relative merits of each of the implemented measures [19,23,25], carried out specifically on microarray data, gives useful insights on the effectiveness of ValWorkBench for cluster validation to researchers in the microarray community interested in its use for the mentioned task.
Collapse
Affiliation(s)
- R Giancarlo
- Dipartimento di Matematica ed Informatica, University of Palermo, Italy.
| | - D Scaturro
- Dipartimento di Matematica ed Informatica, University of Palermo, Italy.
| | - F Utro
- Computational Biology Center, IBM T.J. Watson Research, Yorktown Heights, NY 10598, USA.
| |
Collapse
|
10
|
Fa R, Nandi AK. Noise Resistant Generalized Parametric Validity Index of Clustering for Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:741-752. [PMID: 26356344 DOI: 10.1109/tcbb.2014.2312006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Validity indices have been investigated for decades. However, since there is no study of noise-resistance performance of these indices in the literature, there is no guideline for determining the best clustering in noisy data sets, especially microarray data sets. In this paper, we propose a generalized parametric validity (GPV) index which employs two tunable parameters α and β to control the proportions of objects being considered to calculate the dissimilarities. The greatest advantage of the proposed GPV index is its noise-resistance ability, which results from the flexibility of tuning the parameters. Several rules are set to guide the selection of parameter values. To illustrate the noise-resistance performance of the proposed index, we evaluate the GPV index for assessing five clustering algorithms in two gene expression data simulation models with different noise levels and compare the ability of determining the number of clusters with eight existing indices. We also test the GPV in three groups of real gene expression data sets. The experimental results suggest that the proposed GPV index has superior noise-resistance ability and provides fairly accurate judgements.
Collapse
|
11
|
Mitchell L, Sloan TM, Mewissen M, Ghazal P, Forster T, Piotrowski M, Trew A. Parallel classification and feature selection in microarray data using SPRINT. CONCURRENCY AND COMPUTATION : PRACTICE & EXPERIENCE 2014; 26:854-865. [PMID: 24883047 PMCID: PMC4038771 DOI: 10.1002/cpe.2928] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The statistical language R is favoured by many biostatisticians for processing microarray data. In recent times, the quantity of data that can be obtained in experiments has risen significantly, making previously fast analyses time consuming or even not possible at all with the existing software infrastructure. High performance computing (HPC) systems offer a solution to these problems but at the expense of increased complexity for the end user. The Simple Parallel R Interface is a library for R that aims to reduce the complexity of using HPC systems by providing biostatisticians with drop-in parallelised replacements of existing R functions. In this paper we describe parallel implementations of two popular techniques: exploratory clustering analyses using the random forest classifier and feature selection through identification of differentially expressed genes using the rank product method.
Collapse
Affiliation(s)
- Lawrence Mitchell
- EPCC, School of Physics and Astronomy, University of Edinburgh, Edinburgh, EH9 3JZ, UK
| | - Terence M Sloan
- EPCC, School of Physics and Astronomy, University of Edinburgh, Edinburgh, EH9 3JZ, UK
| | - Muriel Mewissen
- Division of Pathway Medicine, University of Edinburgh, Medical School, 49 Little France Crescent, Edinburgh, EH16 4SB, UK
| | - Peter Ghazal
- Division of Pathway Medicine, University of Edinburgh, Medical School, 49 Little France Crescent, Edinburgh, EH16 4SB, UK
| | - Thorsten Forster
- Division of Pathway Medicine, University of Edinburgh, Medical School, 49 Little France Crescent, Edinburgh, EH16 4SB, UK
| | - Michal Piotrowski
- EPCC, School of Physics and Astronomy, University of Edinburgh, Edinburgh, EH9 3JZ, UK
| | - Arthur Trew
- EPCC, School of Physics and Astronomy, University of Edinburgh, Edinburgh, EH9 3JZ, UK
| |
Collapse
|
12
|
Tong M, Li X, Wegener Parfrey L, Roth B, Ippoliti A, Wei B, Borneman J, McGovern DPB, Frank DN, Li E, Horvath S, Knight R, Braun J. A modular organization of the human intestinal mucosal microbiota and its association with inflammatory bowel disease. PLoS One 2013; 8:e80702. [PMID: 24260458 PMCID: PMC3834335 DOI: 10.1371/journal.pone.0080702] [Citation(s) in RCA: 127] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2013] [Accepted: 10/07/2013] [Indexed: 02/08/2023] Open
Abstract
Abnormalities of the intestinal microbiota are implicated in the pathogenesis of Crohn's disease (CD) and ulcerative colitis (UC), two spectra of inflammatory bowel disease (IBD). However, the high complexity and low inter-individual overlap of intestinal microbial composition are formidable barriers to identifying microbial taxa representing this dysbiosis. These difficulties might be overcome by an ecologic analytic strategy to identify modules of interacting bacteria (rather than individual bacteria) as quantitative reproducible features of microbial composition in normal and IBD mucosa. We sequenced 16S ribosomal RNA genes from 179 endoscopic lavage samples from different intestinal regions in 64 subjects (32 controls, 16 CD and 16 UC patients in clinical remission). CD and UC patients showed a reduction in phylogenetic diversity and shifts in microbial composition, comparable to previous studies using conventional mucosal biopsies. Analysis of weighted co-occurrence network revealed 5 microbial modules. These modules were unprecedented, as they were detectable in all individuals, and their composition and abundance was recapitulated in an independent, biopsy-based mucosal dataset 2 modules were associated with healthy, CD, or UC disease states. Imputed metagenome analysis indicated that these modules displayed distinct metabolic functionality, specifically the enrichment of oxidative response and glycan metabolism pathways relevant to host-pathogen interaction in the disease-associated modules. The highly preserved microbial modules accurately classified IBD status of individual patients during disease quiescence, suggesting that microbial dysbiosis in IBD may be an underlying disorder independent of disease activity. Microbial modules thus provide an integrative view of microbial ecology relevant to IBD.
Collapse
Affiliation(s)
- Maomeng Tong
- Department of Molecular and Medical Pharmacology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Xiaoxiao Li
- Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
| | - Laura Wegener Parfrey
- Department of Chemistry & Biochemistry, University of Colorado, Boulder, Colorado, United States of America
| | - Bennett Roth
- Department of Medicine, Division of Digestive Disease, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Andrew Ippoliti
- Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
| | - Bo Wei
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - James Borneman
- Department of Plant Pathology and Microbiology, University of California Riverside, Riverside, California, United States of America
| | - Dermot P. B. McGovern
- Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
| | - Daniel N. Frank
- Division of Infectious Diseases, University of Colorado, School of Medicine, Aurora, Colorado, United States of America
- Union Council, Denver Microbiome Research Consortium (MiRC), University of Colorado, School of Medicine, Aurora, Colorado, United States of America
| | - Ellen Li
- Department of Medicine, Stony Brook University, Stony Brook, New York, United States of America
| | - Steve Horvath
- Department of Human Genetics and Biostatistics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Rob Knight
- Department of Chemistry & Biochemistry, University of Colorado, Boulder, Colorado, United States of America
- Howard Hughes Medical Institute, University of Colorado, Boulder, Colorado, United States of America;
| | - Jonathan Braun
- Department of Molecular and Medical Pharmacology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| |
Collapse
|
13
|
Budinska E, Popovici V, Tejpar S, D'Ario G, Lapique N, Sikora KO, Di Narzo AF, Yan P, Hodgson JG, Weinrich S, Bosman F, Roth A, Delorenzi M. Gene expression patterns unveil a new level of molecular heterogeneity in colorectal cancer. J Pathol 2013; 231:63-76. [PMID: 23836465 PMCID: PMC3840702 DOI: 10.1002/path.4212] [Citation(s) in RCA: 294] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2013] [Revised: 05/10/2013] [Accepted: 05/14/2013] [Indexed: 02/06/2023]
Abstract
The recognition that colorectal cancer (CRC) is a heterogeneous disease in terms of clinical behaviour and response to therapy translates into an urgent need for robust molecular disease subclassifiers that can explain this heterogeneity beyond current parameters (MSI, KRAS, BRAF). Attempts to fill this gap are emerging. The Cancer Genome Atlas (TGCA) reported two main CRC groups, based on the incidence and spectrum of mutated genes, and another paper reported an EMT expression signature defined subgroup. We performed a prior free analysis of CRC heterogeneity on 1113 CRC gene expression profiles and confronted our findings to established molecular determinants and clinical, histopathological and survival data. Unsupervised clustering based on gene modules allowed us to distinguish at least five different gene expression CRC subtypes, which we call surface crypt-like, lower crypt-like, CIMP-H-like, mesenchymal and mixed. A gene set enrichment analysis combined with literature search of gene module members identified distinct biological motifs in different subtypes. The subtypes, which were not derived based on outcome, nonetheless showed differences in prognosis. Known gene copy number variations and mutations in key cancer-associated genes differed between subtypes, but the subtypes provided molecular information beyond that contained in these variables. Morphological features significantly differed between subtypes. The objective existence of the subtypes and their clinical and molecular characteristics were validated in an independent set of 720 CRC expression profiles. Our subtypes provide a novel perspective on the heterogeneity of CRC. The proposed subtypes should be further explored retrospectively on existing clinical trial datasets and, when sufficiently robust, be prospectively assessed for clinical relevance in terms of prognosis and treatment response predictive capacity. Original microarray data were uploaded to the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress/) under Accession Nos E-MTAB-990 and E-MTAB-1026.
Collapse
Affiliation(s)
- Eva Budinska
- Bioinformatics Core Facility, Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Mavridis L, Nath N, Mitchell JBO. PFClust: a novel parameter free clustering algorithm. BMC Bioinformatics 2013; 14:213. [PMID: 23819480 PMCID: PMC3747858 DOI: 10.1186/1471-2105-14-213] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2013] [Accepted: 07/01/2013] [Indexed: 12/02/2022] Open
Abstract
Background We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of ‘correct’ cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings. Results We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies – even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH. Conclusions We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.
Collapse
Affiliation(s)
- Lazaros Mavridis
- Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, KY16 9ST, Scotland, UK.
| | | | | |
Collapse
|
15
|
Giancarlo R, Lo Bosco G, Pinello L, Utro F. A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinformatics 2013; 14 Suppl 1:S6. [PMID: 23369037 PMCID: PMC3548704 DOI: 10.1186/1471-2105-14-s1-s6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. RESULTS A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as Balanced Misclassification Index (BMI, for short) and of a nontrivial modification of the well known Receiver Operating Curve (ROC, for short), which we refer to as Corrected ROC (CROC, for short). The main results are: (a) a quantitative and qualitative method to describe the intrinsic separation ability of a distance; (b) a quantitative method to assess the performance of a clustering algorithm in conjunction with the intrinsic separation ability of a distance function. The proposed procedure is more informative than the ones available in the literature due to the adopted tools. Indeed, the first one allows to map distances and clustering solutions as graphical objects on a plane, and gives information about the bias of the clustering algorithm with respect to a distance. The second tool is a new external validity index which shows similar performances with respect to the state of the art, but with more flexibility, allowing for a broader spectrum of applications. In fact, it allows not only to quantify the merit of each clustering solution but also to quantify the agglomerative or divisive errors due to the algorithm. CONCLUSIONS The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance d2, Pearson correlation dr and mutual information dMI. Based on the results of the experiments, we have that the Euclidean and Pearson correlation distances have a good intrinsic discrimination ability. Conversely, the mutual information distance does not seem to offer the same flexibility and versatility as the other two distances. Apparently, that is due to well known problems in its estimation. since it requires that a dataset must have a substantial number of features to be reliable. Nevertheless, taking into account such a fact, together with results presented in Priness et al., one receives an indication that dMI may be superior to the other distances considered in this study only in conjunction with clustering algorithms specifically designed for its use. In addition, it results that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the discriminative ability of the distances considered in this study with respect to clustering. The methodology has a range of applicability that goes well beyond microarray data since it is independent of the nature of the input data. The only requirement is that the input data must have the same format of a "feature matrix". In particular it can be used to cluster ChIP-seq data.
Collapse
Affiliation(s)
- Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Universitá di Palermo, Via Archirafi 34, 90123 Palermo, Italy
| | | | | | | |
Collapse
|
16
|
Abstract
Background A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad range of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray co-expression matrices that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae. Methods For each clustering method under study, a variety of parameters were tested. Jaccard similarity was used to measure each cluster's agreement with every GO and KEGG annotation set, and the highest Jaccard score was assigned to the cluster. Clusters were grouped into small, medium, and large bins, and the Jaccard score of the top five scoring clusters in each bin were averaged and reported as the best average top 5 (BAT5) score for the particular method. Results Clusters produced by each method were evaluated based upon the positive match to known pathways. This produces a readily interpretable ranking of the relative effectiveness of clustering on the genes. Methods were also tested to determine whether they were able to identify clusters consistent with those identified by other clustering methods. Conclusions Validation of clusters against known gene classifications demonstrate that for this data, graph-based techniques outperform conventional clustering approaches, suggesting that further development and application of combinatorial strategies is warranted.
Collapse
|
17
|
Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE. clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics 2011; 12:436. [PMID: 22070249 PMCID: PMC3262844 DOI: 10.1186/1471-2105-12-436] [Citation(s) in RCA: 414] [Impact Index Per Article: 31.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2011] [Accepted: 11/09/2011] [Indexed: 12/02/2022] Open
Abstract
Background In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present clusterMaker, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. clusterMaker is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL. Results Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast Saccharomyces cerevisiae; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section. Conclusions The Cytoscape plugin clusterMaker provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the clusterMaker plugin. clusterMaker is available via the Cytoscape plugin manager.
Collapse
Affiliation(s)
- John H Morris
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, USA.
| | | | | | | | | | | | | | | |
Collapse
|
18
|
Ozcaglar C, Shabbeer A, Vandenberg S, Yener B, Bennett KP. Sublineage structure analysis of Mycobacterium tuberculosis complex strains using multiple-biomarker tensors. BMC Genomics 2011; 12 Suppl 2:S1. [PMID: 21988942 PMCID: PMC3194230 DOI: 10.1186/1471-2164-12-s2-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Strains of Mycobacterium tuberculosis complex (MTBC) can be classified into major lineages based on their genotype. Further subdivision of major lineages into sublineages requires multiple biomarkers along with methods to combine and analyze multiple sources of information in one unsupervised learning model. Typically, spacer oligonucleotide type (spoligotype) and mycobacterial interspersed repetitive units (MIRU) are used for TB genotyping and surveillance. Here, we examine the sublineage structure of MTBC strains with multiple biomarkers simultaneously, by employing a tensor clustering framework (TCF) on multiple-biomarker tensors. RESULTS Simultaneous analysis of the spoligotype and MIRU type of strains using TCF on multiple-biomarker tensors leads to coherent sublineages of major lineages with clear and distinctive spoligotype and MIRU signatures. Comparison of tensor sublineages with SpolDB4 families either supports tensor sublineages, or suggests subdivision or merging of SpolDB4 families. High prediction accuracy of major lineage classification with supervised tensor learning on multiple-biomarker tensors validates our unsupervised analysis of sublineages on multiple-biomarker tensors. CONCLUSIONS TCF on multiple-biomarker tensors achieves simultaneous analysis of multiple biomarkers and suggest a new putative sublineage structure for each major lineage. Analysis of multiple-biomarker tensors gives insight into the sublineage structure of MTBC at the genomic level.
Collapse
Affiliation(s)
- Cagri Ozcaglar
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, USA
| | - Amina Shabbeer
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, USA
| | - Scott Vandenberg
- Computer Science Department, Siena College, Loudonville, NY, USA
| | - Bülent Yener
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, USA
| | - Kristin P Bennett
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY, USA
- Mathematical Sciences Department, Rensselaer Polytechnic Institute, Troy, NY, USA
| |
Collapse
|
19
|
Naegle KM, Welsch RE, Yaffe MB, White FM, Lauffenburger DA. MCAM: multiple clustering analysis methodology for deriving hypotheses and insights from high-throughput proteomic datasets. PLoS Comput Biol 2011; 7:e1002119. [PMID: 21799663 PMCID: PMC3140961 DOI: 10.1371/journal.pcbi.1002119] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2011] [Accepted: 05/25/2011] [Indexed: 01/22/2023] Open
Abstract
Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology (‘MCAM’) employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems. Proteomic measurements, especially modification measurements, are greatly expanding the current knowledge of the state of proteins under various conditions. Harnessing these measurements to understand how these modifications are enzymatically regulated and their subsequent function in cellular signaling and physiology is a challenging new problem. Clustering has been very useful in reducing the dimensionality of many types of high-throughput biological data, as well inferring function of poorly understood molecular species. However, its implementation requires a great deal of technical expertise since there are a large number of parameters one must decide on in clustering, including data transforms, distance metrics, and algorithms. Previous knowledge of useful parameters does not exist for measurements of a new type. In this work we address two issues. First, we develop a framework that incorporates any number of possible parameters of clustering to produce a suite of clustering solutions. These solutions are then judged on their ability to infer biological information through statistical enrichment of existing biological annotations. Second, we apply this framework to dynamic phosphorylation measurements of the ERBB network, constructing the first extensive analysis of clustering of phosphoproteomic data and generating insight into novel components and novel functions of known components of the ERBB network.
Collapse
Affiliation(s)
- Kristen M Naegle
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | | | | | | | | |
Collapse
|
20
|
Albaum SP, Hahne H, Otto A, Haußmann U, Becher D, Poetsch A, Goesmann A, Nattkemper TW. A guide through the computational analysis of isotope-labeled mass spectrometry-based quantitative proteomics data: an application study. Proteome Sci 2011; 9:30. [PMID: 21663690 PMCID: PMC3142201 DOI: 10.1186/1477-5956-9-30] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2011] [Accepted: 06/11/2011] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Mass spectrometry-based proteomics has reached a stage where it is possible to comprehensively analyze the whole proteome of a cell in one experiment. Here, the employment of stable isotopes has become a standard technique to yield relative abundance values of proteins. In recent times, more and more experiments are conducted that depict not only a static image of the up- or down-regulated proteins at a distinct time point but instead compare developmental stages of an organism or varying experimental conditions. RESULTS Although the scientific questions behind these experiments are of course manifold, there are, nevertheless, two questions that commonly arise: 1) which proteins are differentially regulated regarding the selected experimental conditions, and 2) are there groups of proteins that show similar abundance ratios, indicating that they have a similar turnover? We give advice on how these two questions can be answered and comprehensively compare a variety of commonly applied computational methods and their outcomes. CONCLUSIONS This work provides guidance through the jungle of computational methods to analyze mass spectrometry-based isotope-labeled datasets and recommends an effective and easy-to-use evaluation strategy. We demonstrate our approach with three recently published datasets on Bacillus subtilis 12 and Corynebacterium glutamicum 3. Special focus is placed on the application and validation of cluster analysis methods. All applied methods were implemented within the rich internet application QuPE 4. Results can be found at http://qupe.cebitec.uni-bielefeld.de.
Collapse
Affiliation(s)
- Stefan P Albaum
- Computational Genomics, Center for Biotechnology (CeBiTec), Bielefeld University,
Germany
- Biodata Mining Group, Faculty of Technology, Bielefeld University, Germany
| | - Hannes Hahne
- Chair for Proteomics and Bioanalytics, Center of Life and Food Sciences
Weihenstephan, Technische Universität München, Germany
- Institute of Microbiology, Ernst-Moritz-Arndt-University Greifswald, Germany
- Current Address: Chair for Proteomics and Bioanalytics, Center of Life and Food
Sciences Weihenstephan, Technische Universität München, Germany
| | - Andreas Otto
- Institute of Microbiology, Ernst-Moritz-Arndt-University Greifswald, Germany
| | - Ute Haußmann
- Plant Biochemistry, Ruhr University Bochum, Germany
| | - Dörte Becher
- Institute of Microbiology, Ernst-Moritz-Arndt-University Greifswald, Germany
| | | | - Alexander Goesmann
- Computational Genomics, Center for Biotechnology (CeBiTec), Bielefeld University,
Germany
- Bioinformatics Resource Facility, CeBiTec, Bielefeld University, Germany
| | - Tim W Nattkemper
- Biodata Mining Group, Faculty of Technology, Bielefeld University, Germany
| |
Collapse
|
21
|
Echenique-Rivera H, Muzzi A, Del Tordello E, Seib KL, Francois P, Rappuoli R, Pizza M, Serruto D. Transcriptome analysis of Neisseria meningitidis in human whole blood and mutagenesis studies identify virulence factors involved in blood survival. PLoS Pathog 2011; 7:e1002027. [PMID: 21589640 PMCID: PMC3088726 DOI: 10.1371/journal.ppat.1002027] [Citation(s) in RCA: 117] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2010] [Accepted: 02/26/2011] [Indexed: 12/14/2022] Open
Abstract
During infection Neisseria meningitidis (Nm) encounters multiple environments within the host, which makes rapid adaptation a crucial factor for meningococcal survival. Despite the importance of invasion into the bloodstream in the meningococcal disease process, little is known about how Nm adapts to permit survival and growth in blood. To address this, we performed a time-course transcriptome analysis using an ex vivo model of human whole blood infection. We observed that Nm alters the expression of ≈30% of ORFs of the genome and major dynamic changes were observed in the expression of transcriptional regulators, transport and binding proteins, energy metabolism, and surface-exposed virulence factors. In particular, we found that the gene encoding the regulator Fur, as well as all genes encoding iron uptake systems, were significantly up-regulated. Analysis of regulated genes encoding for surface-exposed proteins involved in Nm pathogenesis allowed us to better understand mechanisms used to circumvent host defenses. During blood infection, Nm activates genes encoding for the factor H binding proteins, fHbp and NspA, genes encoding for detoxifying enzymes such as SodC, Kat and AniA, as well as several less characterized surface-exposed proteins that might have a role in blood survival. Through mutagenesis studies of a subset of up-regulated genes we were able to identify new proteins important for survival in human blood and also to identify additional roles of previously known virulence factors in aiding survival in blood. Nm mutant strains lacking the genes encoding the hypothetical protein NMB1483 and the surface-exposed proteins NalP, Mip and NspA, the Fur regulator, the transferrin binding protein TbpB, and the L-lactate permease LctP were sensitive to killing by human blood. This increased knowledge of how Nm responds to adaptation in blood could also be helpful to develop diagnostic and therapeutic strategies to control the devastating disease cause by this microorganism.
Collapse
MESH Headings
- Adaptation, Physiological
- Adult
- Antigens, Bacterial/genetics
- Bacteremia/blood
- Bacteremia/microbiology
- Bacterial Proteins/genetics
- Cluster Analysis
- Down-Regulation/genetics
- Female
- Gene Expression Regulation, Bacterial/genetics
- Genes, Bacterial/genetics
- Genome, Bacterial/genetics
- Host-Pathogen Interactions/genetics
- Humans
- Male
- Meningococcal Infections/blood
- Meningococcal Infections/microbiology
- Models, Biological
- Neisseria meningitidis, Serogroup B/genetics
- Neisseria meningitidis, Serogroup B/growth & development
- Neisseria meningitidis, Serogroup B/pathogenicity
- Neisseria meningitidis, Serogroup B/physiology
- RNA, Bacterial/genetics
- Sequence Deletion
- Transcriptome
- Up-Regulation/genetics
- Virulence Factors/genetics
Collapse
Affiliation(s)
| | | | | | | | - Patrice Francois
- Genomic Research Laboratory, University of
Geneva Hospitals (HUG), Geneva, Switzerland
| | | | | | - Davide Serruto
- Novartis Vaccines and Diagnostics, Siena,
Italy
- * E-mail:
| |
Collapse
|
22
|
Giancarlo R, Utro F. Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms Mol Biol 2011; 6:1. [PMID: 21235792 PMCID: PMC3035181 DOI: 10.1186/1748-7188-6-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2010] [Accepted: 01/14/2011] [Indexed: 11/10/2022] Open
Abstract
Background The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up. Results Since the time-precision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, FC turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of Consensus. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by Consensus. We have also experimented with the use of Consensus and FC in conjunction with NMF (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although NMF is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about NMF, shedding further light on its merits and limitations. Conclusions In summary, FC with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures.
Collapse
|
23
|
Freyhult E, Landfors M, Önskog J, Hvidsten TR, Rydén P. Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering. BMC Bioinformatics 2010; 11:503. [PMID: 20937082 PMCID: PMC3098084 DOI: 10.1186/1471-2105-11-503] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2010] [Accepted: 10/11/2010] [Indexed: 08/30/2023] Open
Abstract
BACKGROUND Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle missing values, standardize the data and select genes. In addition, pre-processing, involving various types of filtration and normalization procedures, can have an effect on the ability to discover biologically relevant classes. Here we consider cluster analysis in a broad sense and perform a comprehensive evaluation that covers several aspects of cluster analyses, including normalization. RESULT We evaluated 2780 cluster analysis methods on seven publicly available 2-channel microarray data sets with common reference designs. Each cluster analysis method differed in data normalization (5 normalizations were considered), missing value imputation (2), standardization of data (2), gene selection (19) or clustering method (11). The cluster analyses are evaluated using known classes, such as cancer types, and the adjusted Rand index. The performances of the different analyses vary between the data sets and it is difficult to give general recommendations. However, normalization, gene selection and clustering method are all variables that have a significant impact on the performance. In particular, gene selection is important and it is generally necessary to include a relatively large number of genes in order to get good performance. Selecting genes with high standard deviation or using principal component analysis are shown to be the preferred gene selection methods. Hierarchical clustering using Ward's method, k-means clustering and Mclust are the clustering methods considered in this paper that achieves the highest adjusted Rand. Normalization can have a significant positive impact on the ability to cluster individuals, and there are indications that background correction is preferable, in particular if the gene selection is successful. However, this is an area that needs to be studied further in order to draw any general conclusions. CONCLUSIONS The choice of cluster analysis, and in particular gene selection, has a large impact on the ability to cluster individuals correctly based on expression profiles. Normalization has a positive effect, but the relative performance of different normalizations is an area that needs more research. In summary, although clustering, gene selection and normalization are considered standard methods in bioinformatics, our comprehensive analysis shows that selecting the right methods, and the right combinations of methods, is far from trivial and that much is still unexplored in what is considered to be the most basic analysis of genomic data.
Collapse
Affiliation(s)
- Eva Freyhult
- Department of Clinical Microbiology, Division of Clinical Bacteriology, Umeå University, Umeå, Sweden.
| | | | | | | | | |
Collapse
|
24
|
Olex AL, Hiltbold EM, Leng X, Fetrow JS. Dynamics of dendritic cell maturation are identified through a novel filtering strategy applied to biological time-course microarray replicates. BMC Immunol 2010; 11:41. [PMID: 20682054 PMCID: PMC2928180 DOI: 10.1186/1471-2172-11-41] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2009] [Accepted: 08/03/2010] [Indexed: 01/04/2023] Open
Abstract
Background Dendritic cells (DC) play a central role in primary immune responses and become potent stimulators of the adaptive immune response after undergoing the critical process of maturation. Understanding the dynamics of DC maturation would provide key insights into this important process. Time course microarray experiments can provide unique insights into DC maturation dynamics. Replicate experiments are necessary to address the issues of experimental and biological variability. Statistical methods and averaging are often used to identify significant signals. Here a novel strategy for filtering of replicate time course microarray data, which identifies consistent signals between the replicates, is presented and applied to a DC time course microarray experiment. Results The temporal dynamics of DC maturation were studied by stimulating DC with poly(I:C) and following gene expression at 5 time points from 1 to 24 hours. The novel filtering strategy uses standard statistical and fold change techniques, along with the consistency of replicate temporal profiles, to identify those differentially expressed genes that were consistent in two biological replicate experiments. To address the issue of cluster reproducibility a consensus clustering method, which identifies clusters of genes whose expression varies consistently between replicates, was also developed and applied. Analysis of the resulting clusters revealed many known and novel characteristics of DC maturation, such as the up-regulation of specific immune response pathways. Intriguingly, more genes were down-regulated than up-regulated. Results identify a more comprehensive program of down-regulation, including many genes involved in protein synthesis, metabolism, and housekeeping needed for maintenance of cellular integrity and metabolism. Conclusions The new filtering strategy emphasizes the importance of consistent and reproducible results when analyzing microarray data and utilizes consistency between replicate experiments as a criterion in both feature selection and clustering, without averaging or otherwise combining replicate data. Observation of a significant down-regulation program during DC maturation indicates that DC are preparing for cell death and provides a path to better understand the process. This new filtering strategy can be adapted for use in analyzing other large-scale time course data sets with replicates.
Collapse
Affiliation(s)
- Amy L Olex
- Department of Computer Science, Wake Forest University, Winston-Salem, NC 27109, USA
| | | | | | | |
Collapse
|
25
|
Cabanski CR, Qi Y, Yin X, Bair E, Hayward MC, Fan C, Li J, Wilkerson MD, Marron JS, Perou CM, Hayes DN. SWISS MADE: Standardized WithIn Class Sum of Squares to evaluate methodologies and dataset elements. PLoS One 2010; 5:e9905. [PMID: 20360852 PMCID: PMC2845619 DOI: 10.1371/journal.pone.0009905] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2009] [Accepted: 02/26/2010] [Indexed: 11/19/2022] Open
Abstract
Contemporary high dimensional biological assays, such as mRNA expression microarrays, regularly involve multiple data processing steps, such as experimental processing, computational processing, sample selection, or feature selection (i.e. gene selection), prior to deriving any biological conclusions. These steps can dramatically change the interpretation of an experiment. Evaluation of processing steps has received limited attention in the literature. It is not straightforward to evaluate different processing methods and investigators are often unsure of the best method. We present a simple statistical tool, Standardized WithIn class Sum of Squares (SWISS), that allows investigators to compare alternate data processing methods, such as different experimental methods, normalizations, or technologies, on a dataset in terms of how well they cluster a priori biological classes. SWISS uses Euclidean distance to determine which method does a better job of clustering the data elements based on a priori classifications. We apply SWISS to three different gene expression applications. The first application uses four different datasets to compare different experimental methods, normalizations, and gene sets. The second application, using data from the MicroArray Quality Control (MAQC) project, compares different microarray platforms. The third application compares different technologies: a single Agilent two-color microarray versus one lane of RNA-Seq. These applications give an indication of the variety of problems that SWISS can be helpful in solving. The SWISS analysis of one-color versus two-color microarrays provides investigators who use two-color arrays the opportunity to review their results in light of a single-channel analysis, with all of the associated benefits offered by this design. Analysis of the MACQ data shows differential intersite reproducibility by array platform. SWISS also shows that one lane of RNA-Seq clusters data by biological phenotypes as well as a single Agilent two-color microarray.
Collapse
Affiliation(s)
- Christopher R. Cabanski
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Yuan Qi
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Xiaoying Yin
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Otolaryngology/Head and Neck Surgery, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Eric Bair
- School of Dentistry, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Michele C. Hayward
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Cheng Fan
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Jianying Li
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Matthew D. Wilkerson
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - J. S. Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Charles M. Perou
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Pathology and Laboratory Medicine, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - D. Neil Hayes
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Division of Medical Oncology, Department of Internal Medicine, University of North Carolina, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
26
|
Newman AM, Cooper JB. AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number. BMC Bioinformatics 2010; 11:117. [PMID: 20202218 PMCID: PMC2846907 DOI: 10.1186/1471-2105-11-117] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2009] [Accepted: 03/04/2010] [Indexed: 12/25/2022] Open
Abstract
Background Clustering the information content of large high-dimensional gene expression datasets has widespread application in "omics" biology. Unfortunately, the underlying structure of these natural datasets is often fuzzy, and the computational identification of data clusters generally requires knowledge about cluster number and geometry. Results We integrated strategies from machine learning, cartography, and graph theory into a new informatics method for automatically clustering self-organizing map ensembles of high-dimensional data. Our new method, called AutoSOME, readily identifies discrete and fuzzy data clusters without prior knowledge of cluster number or structure in diverse datasets including whole genome microarray data. Visualization of AutoSOME output using network diagrams and differential heat maps reveals unexpected variation among well-characterized cancer cell lines. Co-expression analysis of data from human embryonic and induced pluripotent stem cells using AutoSOME identifies >3400 up-regulated genes associated with pluripotency, and indicates that a recently identified protein-protein interaction network characterizing pluripotency was underestimated by a factor of four. Conclusions By effectively extracting important information from high-dimensional microarray data without prior knowledge or the need for data filtration, AutoSOME can yield systems-level insights from whole genome microarray expression studies. Due to its generality, this new method should also have practical utility for a variety of data-intensive applications, including the results of deep sequencing experiments. AutoSOME is available for download at http://jimcooperlab.mcdb.ucsb.edu/autosome.
Collapse
Affiliation(s)
- Aaron M Newman
- Biomolecular Science and Engineering Program, University of California, Santa Barbara, CA 93106, USA
| | | |
Collapse
|
27
|
Klie S, Nikoloski Z, Selbig J. Biological cluster evaluation for gene function prediction. J Comput Biol 2010; 21:428-45. [PMID: 20059365 DOI: 10.1089/cmb.2009.0129] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recent advances in high-throughput omics techniques render it possible to decode the function of genes by using the "guilt-by-association" principle on biologically meaningful clusters of gene expression data. However, the existing frameworks for biological evaluation of gene clusters are hindered by two bottleneck issues: (1) the choice for the number of clusters, and (2) the external measures which do not take in consideration the structure of the analyzed data and the ontology of the existing biological knowledge. Here, we address the identified bottlenecks by developing a novel framework that allows not only for biological evaluation of gene expression clusters based on existing structured knowledge, but also for prediction of putative gene functions. The proposed framework facilitates propagation of statistical significance at each of the following steps: (1) estimating the number of clusters, (2) evaluating the clusters in terms of novel external structural measures, (3) selecting an optimal clustering algorithm, and (4) predicting gene functions. The framework also includes a method for evaluation of gene clusters based on the structure of the employed ontology. Moreover, our method for obtaining a probabilistic range for the number of clusters is demonstrated valid on synthetic data and available gene expression profiles from Saccharomyces cerevisiae. Finally, we propose a network-based approach for gene function prediction which relies on the clustering of optimal score and the employed ontology. Our approach effectively predicts gene function on the Saccharomyces cerevisiae data set and is also employed to obtain putative gene functions for an Arabidopsis thaliana data set.
Collapse
Affiliation(s)
- Sebastian Klie
- 1 Max-Planck Institute for Molecular Plant Physiology , Potsdam, Brandenburg, Germany
| | | | | |
Collapse
|
28
|
Distance Functions, Clustering Algorithms and Microarray Data Analysis. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-13800-3_10] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
29
|
Zhang M, Zhang W, Sicotte H, Yang P. A new validity measure for a correlation-based fuzzy c-means clustering algorithm. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009; 2009:3865-8. [PMID: 19963601 DOI: 10.1109/iembs.2009.5332582] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
One of the major challenges in unsupervised clustering is the lack of consistent means for assessing the quality of clusters. In this paper, we evaluate several validity measures in fuzzy clustering and develop a new measure for a fuzzy c-means algorithm which uses a Pearson correlation in its distance metrics. The measure is designed with within-cluster sum of square, and makes use of fuzzy memberships. In comparing to the existing fuzzy partition coefficient and a fuzzy validity index, this new measure performs consistently across six microarray datasets. The newly developed measure could be used to assess the validity of fuzzy clusters produced by a correlation-based fuzzy c-means clustering algorithm.
Collapse
Affiliation(s)
- Mingrui Zhang
- Computer Science Department, Winona State University, MN 55987, USA.
| | | | | | | |
Collapse
|