Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Giancarlo R, Scaturro D, Utro F. Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 2008;9:462. [PMID: 18959783 PMCID: PMC2657801 DOI: 10.1186/1471-2105-9-462] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2008] [Accepted: 10/29/2008] [Indexed: 12/04/2022] Open

For:	Giancarlo R, Scaturro D, Utro F. Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer. BMC Bioinformatics 2008;9:462. [PMID: 18959783 PMCID: PMC2657801 DOI: 10.1186/1471-2105-9-462] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2008] [Accepted: 10/29/2008] [Indexed: 12/04/2022] Open

Number

Cited by Other Article(s)

Qin Z, Yang L, Gao F, Hu Q, Shen C. Uncertainty-Aware Aggregation for Federated Open Set Domain Adaptation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024;35:7548-7562. [PMID: 36306293 DOI: 10.1109/tnnls.2022.3214930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]

Meng J, Jiang A, Lu X, Gu D, Ge Q, Bai S, Zhou Y, Zhou J, Hao Z, Yan F, Wang L, Wang H, Du J, Liang C. Multiomics characterization and verification of clear cell renal cell carcinoma molecular subtypes to guide precise chemotherapy and immunotherapy. IMETA 2023;2:e147. [PMID: 38868222 PMCID: PMC10989995 DOI: 10.1002/imt2.147] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Accepted: 10/21/2023] [Indexed: 06/14/2024]

Abstract

Clear cell renal cell carcinoma (ccRCC) is a heterogeneous tumor with different genetic and molecular alterations. Schemes for ccRCC classification system based on multiomics are urgent, to promote further biological insights. Two hundred and fifty-five ccRCC patients with paired data of clinical information, transcriptome expression profiles, copy number alterations, DNA methylation, and somatic mutations were collected for identification. Bioinformatic analyses were performed based on our team's recently developed R package "MOVICS." With 10 state-of-the-art algorithms, we identified the multiomics subtypes (MoSs) for ccRCC patients. MoS1 is an immune exhausted subtype, presented the poorest prognosis, and might be caused by an exhausted immune microenvironment, activated hypoxia features, but can benefit from PI3K/AKT inhibitors. MoS2 is an immune "cold" subtype, which represented more mutation of VHL and PBRM1, favorable prognosis, and is more suitable for sunitinib therapy. MoS3 is the immune "hot" subtype, and can benefit from the anti-PD-1 immunotherapy. We successfully verified the different molecular features of the three MoSs in external cohorts GSE22541, GSE40435, and GSE53573. Patients that received Nivolumab therapy helped us to confirm that MoS3 is suitable for anti-PD-1 therapy. E-MTAB-3267 cohort also supported the fact that MoS2 patients can respond more to sunitinib treatment. We also confirm that SETD2 is a tumor suppressor in ccRCC, along with the decreased SETD2 protein level in advanced tumor stage, and knock-down of SETD2 leads to the promotion of cell proliferation, migration, and invasion. In summary, we provide novel insights into ccRCC molecular subtypes based on robust clustering algorithms via multiomics data, and encourage future precise treatment of ccRCC patients.

Collapse

Affiliation(s)

Jialin Meng Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
Aimin Jiang Department of Urology, Changhai HospitalNaval Medical University (Second Military Medical University)ShanghaiChina
Xiaofan Lu Department of Cancer and Functional GenomicsInstitute of Genetics and Molecular and Cellular Biology, CNRS/INSERM/UNISTRAIllkirchFrance
Di Gu Department of Urology, Changhai HospitalNaval Medical University (Second Military Medical University)ShanghaiChina
Qintao Ge Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
Suwen Bai The Second Affiliated Hospital, School of MedicineThe Chinese University of Hong Kong, Shenzhen & Longgang District People's Hospital of ShenzhenShenzhenChina
Yundong Zhou Department of Surgery, Ningbo Medical Center Lihuili HospitalNingbo UniversityNingboZhejiangChina
Jun Zhou Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
Zongyao Hao Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina
Fangrong Yan Research Center of Biostatistics and Computational PharmacyChina Pharmaceutical UniversityNanjingChina
Linhui Wang Department of Urology, Changhai HospitalNaval Medical University (Second Military Medical University)ShanghaiChina
Haitao Wang Cancer Center, Faculty of Health SciencesUniversity of MacauMacau SARChina Present address: Center for Cancer ResearchBethesdaMarylandUSA
Juan Du The Second Affiliated Hospital, School of MedicineThe Chinese University of Hong Kong, Shenzhen & Longgang District People's Hospital of ShenzhenShenzhenChina
Chaozhao Liang Department of UrologyThe First Affiliated Hospital of Anhui Medical University, Institute of Urology, Anhui Medical University, Anhui Province Key Laboratory of Genitourinary DiseasesAnhui Medical UniversityHefeiChina

Collapse

Using a national level cross-sectional study to develop a Hospital Preparedness Index (HOSPI) for Covid-19 management: A case study from India. PLoS One 2022;17:e0269842. [PMID: 35895724 PMCID: PMC9328545 DOI: 10.1371/journal.pone.0269842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 05/29/2022] [Indexed: 11/26/2022] Open

Petrillo UF, Palini F, Cattaneo G, Giancarlo R. Alignment-free Genomic Analysis via a Big Data Spark Platform. Bioinformatics 2021;37:1658-1665. [PMID: 33471066 DOI: 10.1093/bioinformatics/btab014] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 12/28/2020] [Accepted: 01/06/2021] [Indexed: 11/12/2022] Open

Abstract

MOTIVATION

Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity.

RESULTS

We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE.

AVAILABILITY

The software and the datasets are available at https://github.com/fpalini/fade.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Guttula PK, Gupta MK. Examining the co-expression, transcriptome clustering and variation using fuzzy cluster network of testicular stem cells and pluripotent stem cells compared with other cell types. Comput Biol Chem 2020;85:107227. [PMID: 32044562 DOI: 10.1016/j.compbiolchem.2020.107227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Revised: 10/10/2019] [Accepted: 01/31/2020] [Indexed: 10/25/2022]

Li X, LeBlanc J, Elashoff D, McHardy I, Tong M, Roth B, Ippoliti A, Barron G, McGovern D, McDonald K, Newberry R, Graeber T, Horvath S, Goodglick L, Braun J. Microgeographic Proteomic Networks of the Human Colonic Mucosa and Their Association With Inflammatory Bowel Disease. Cell Mol Gastroenterol Hepatol 2016;2:567-583. [PMID: 28174738 PMCID: PMC5042708 DOI: 10.1016/j.jcmgh.2016.05.003] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 05/06/2016] [Indexed: 12/28/2022]

Abstract

BACKGROUND & AIMS

Interactions between mucosal cell types, environmental stressors, and intestinal microbiota contribute to pathogenesis in inflammatory bowel disease (IBD). Here, we applied metaproteomics of the mucosal-luminal interface to study the disease-related biology of the human colonic mucosa.

METHODS

We recruited a discovery cohort of 51 IBD and non-IBD subjects endoscopically sampled by mucosal lavage at 6 colonic regions, and a validation cohort of 38 no-IBD subjects. Metaproteome data sets were produced for each sample and analyzed for association with colonic site and disease state using a suite of bioinformatic approaches. Localization of select proteins was determined by immunoblot analysis and immunohistochemistry of human endoscopic biopsy samples.

RESULTS

Co-occurrence analysis of the discovery cohort metaproteome showed that proteins at the mucosal surface clustered into modules with evidence of differential functional specialization (eg, iron regulation, microbial defense) and cellular origin (eg, epithelial or hemopoietic). These modules, validated in an independent cohort, were differentially associated spatially along the gastrointestinal tract, and 7 modules were associated selectively with non-IBD, ulcerative colitis, and/or Crohn's disease states. In addition, the detailed composition of certain modules was altered in disease vs healthy states. We confirmed the predicted spatial and disease-associated localization of 28 proteins representing 4 different disease-related modules by immunoblot and immunohistochemistry visualization, with evidence for their distribution as millimeter-scale microgeographic mosaic.

CONCLUSIONS

These findings suggest that the mucosal surface is a microgeographic mosaic of functional networks reflecting the local mucosal ecology, whose compositional differences in disease and healthy samples may provide a unique readout of physiologic and pathologic mucosal states.

Collapse

Affiliation(s)

Xiaoxiao Li Department of Molecular and Medical Pharmacology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California,2Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California,3Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
James LeBlanc Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
David Elashoff Department of Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
Ian McHardy Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
Maomeng Tong Department of Molecular and Medical Pharmacology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
Bennett Roth Department of Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
Andrew Ippoliti Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
Gildardo Barron Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
Dermot McGovern Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
Keely McDonald Department of Internal Medicine, Washington University School of Medicine, St. Louis, Missouri
Rodney Newberry Department of Internal Medicine, Washington University School of Medicine, St. Louis, Missouri
Thomas Graeber Department of Molecular and Medical Pharmacology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
Steve Horvath Department of Human Genetics and Biostatistics, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
Lee Goodglick Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California
Jonathan Braun Department of Molecular and Medical Pharmacology, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California,2Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California,∗Correspondence Address correspondence to: Jonathan Braun, MD, PhD, Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of Medicine, Los Angeles, California 90095. fax: (310) 267-4486.Department of Pathology and Laboratory Medicine, University of California Los Angeles David Geffen School of MedicineLos AngelesCalifornia 90095

Collapse

Bai L, Wang F, Zhang DS, Li C, Jin Y, Wang DS, Chen DL, Qiu MZ, Luo HY, Wang ZQ, Li YH, Wang FH, Xu RH. A plasma cytokine and angiogenic factor (CAF) analysis for selection of bevacizumab therapy in patients with metastatic colorectal cancer. Sci Rep 2015;5:17717. [PMID: 26620439 PMCID: PMC4664961 DOI: 10.1038/srep17717] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Accepted: 11/04/2015] [Indexed: 01/09/2023] Open

Affiliation(s)

Long Bai Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Feng Wang Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Dong-Sheng Zhang Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Cong Li Department of Medical Oncology, Zhejiang Cancer Hospital, Hangzhou 310022, P. R. China
Ying Jin Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
De-Shen Wang Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Dong-Liang Chen Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Miao-Zhen Qiu Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Hui-Yan Luo Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Zhi-Qiang Wang Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Yu-Hong Li Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Feng-Hua Wang Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China
Rui-Hua Xu Department of Medical Oncology, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, P. R. China.,State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, Guangzhou, Guangdong 510060, P. R. China

Collapse

Hu CW, Kornblau SM, Slater JH, Qutub AA. Progeny Clustering: A Method to Identify Biological Phenotypes. Sci Rep 2015;5:12894. [PMID: 26267476 PMCID: PMC4533525 DOI: 10.1038/srep12894] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Accepted: 07/15/2015] [Indexed: 01/24/2023] Open

Giancarlo R, Scaturro D, Utro F. ValWorkBench: an open source Java library for cluster validation, with applications to microarray data analysis. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2015;118:207-217. [PMID: 25582071 DOI: 10.1016/j.cmpb.2014.12.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2014] [Revised: 10/07/2014] [Accepted: 12/16/2014] [Indexed: 06/04/2023]

Fa R, Nandi AK. Noise Resistant Generalized Parametric Validity Index of Clustering for Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014;11:741-752. [PMID: 26356344 DOI: 10.1109/tcbb.2014.2312006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Mitchell L, Sloan TM, Mewissen M, Ghazal P, Forster T, Piotrowski M, Trew A. Parallel classification and feature selection in microarray data using SPRINT. CONCURRENCY AND COMPUTATION : PRACTICE & EXPERIENCE 2014;26:854-865. [PMID: 24883047 PMCID: PMC4038771 DOI: 10.1002/cpe.2928] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]

Tong M, Li X, Wegener Parfrey L, Roth B, Ippoliti A, Wei B, Borneman J, McGovern DPB, Frank DN, Li E, Horvath S, Knight R, Braun J. A modular organization of the human intestinal mucosal microbiota and its association with inflammatory bowel disease. PLoS One 2013;8:e80702. [PMID: 24260458 PMCID: PMC3834335 DOI: 10.1371/journal.pone.0080702] [Citation(s) in RCA: 127] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2013] [Accepted: 10/07/2013] [Indexed: 02/08/2023] Open

Abstract

Abnormalities of the intestinal microbiota are implicated in the pathogenesis of Crohn's disease (CD) and ulcerative colitis (UC), two spectra of inflammatory bowel disease (IBD). However, the high complexity and low inter-individual overlap of intestinal microbial composition are formidable barriers to identifying microbial taxa representing this dysbiosis. These difficulties might be overcome by an ecologic analytic strategy to identify modules of interacting bacteria (rather than individual bacteria) as quantitative reproducible features of microbial composition in normal and IBD mucosa. We sequenced 16S ribosomal RNA genes from 179 endoscopic lavage samples from different intestinal regions in 64 subjects (32 controls, 16 CD and 16 UC patients in clinical remission). CD and UC patients showed a reduction in phylogenetic diversity and shifts in microbial composition, comparable to previous studies using conventional mucosal biopsies. Analysis of weighted co-occurrence network revealed 5 microbial modules. These modules were unprecedented, as they were detectable in all individuals, and their composition and abundance was recapitulated in an independent, biopsy-based mucosal dataset 2 modules were associated with healthy, CD, or UC disease states. Imputed metagenome analysis indicated that these modules displayed distinct metabolic functionality, specifically the enrichment of oxidative response and glycan metabolism pathways relevant to host-pathogen interaction in the disease-associated modules. The highly preserved microbial modules accurately classified IBD status of individual patients during disease quiescence, suggesting that microbial dysbiosis in IBD may be an underlying disorder independent of disease activity. Microbial modules thus provide an integrative view of microbial ecology relevant to IBD.

Collapse

Affiliation(s)

Maomeng Tong Department of Molecular and Medical Pharmacology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
Xiaoxiao Li Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
Laura Wegener Parfrey Department of Chemistry & Biochemistry, University of Colorado, Boulder, Colorado, United States of America
Bennett Roth Department of Medicine, Division of Digestive Disease, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
Andrew Ippoliti Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
Bo Wei Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
James Borneman Department of Plant Pathology and Microbiology, University of California Riverside, Riverside, California, United States of America
Dermot P. B. McGovern Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
Daniel N. Frank Division of Infectious Diseases, University of Colorado, School of Medicine, Aurora, Colorado, United States of America Union Council, Denver Microbiome Research Consortium (MiRC), University of Colorado, School of Medicine, Aurora, Colorado, United States of America
Ellen Li Department of Medicine, Stony Brook University, Stony Brook, New York, United States of America
Steve Horvath Department of Human Genetics and Biostatistics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
Rob Knight Department of Chemistry & Biochemistry, University of Colorado, Boulder, Colorado, United States of America Howard Hughes Medical Institute, University of Colorado, Boulder, Colorado, United States of America;
Jonathan Braun Department of Molecular and Medical Pharmacology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America

Collapse

Budinska E, Popovici V, Tejpar S, D'Ario G, Lapique N, Sikora KO, Di Narzo AF, Yan P, Hodgson JG, Weinrich S, Bosman F, Roth A, Delorenzi M. Gene expression patterns unveil a new level of molecular heterogeneity in colorectal cancer. J Pathol 2013;231:63-76. [PMID: 23836465 PMCID: PMC3840702 DOI: 10.1002/path.4212] [Citation(s) in RCA: 294] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2013] [Revised: 05/10/2013] [Accepted: 05/14/2013] [Indexed: 02/06/2023]

Abstract

The recognition that colorectal cancer (CRC) is a heterogeneous disease in terms of clinical behaviour and response to therapy translates into an urgent need for robust molecular disease subclassifiers that can explain this heterogeneity beyond current parameters (MSI, KRAS, BRAF). Attempts to fill this gap are emerging. The Cancer Genome Atlas (TGCA) reported two main CRC groups, based on the incidence and spectrum of mutated genes, and another paper reported an EMT expression signature defined subgroup. We performed a prior free analysis of CRC heterogeneity on 1113 CRC gene expression profiles and confronted our findings to established molecular determinants and clinical, histopathological and survival data. Unsupervised clustering based on gene modules allowed us to distinguish at least five different gene expression CRC subtypes, which we call surface crypt-like, lower crypt-like, CIMP-H-like, mesenchymal and mixed. A gene set enrichment analysis combined with literature search of gene module members identified distinct biological motifs in different subtypes. The subtypes, which were not derived based on outcome, nonetheless showed differences in prognosis. Known gene copy number variations and mutations in key cancer-associated genes differed between subtypes, but the subtypes provided molecular information beyond that contained in these variables. Morphological features significantly differed between subtypes. The objective existence of the subtypes and their clinical and molecular characteristics were validated in an independent set of 720 CRC expression profiles. Our subtypes provide a novel perspective on the heterogeneity of CRC. The proposed subtypes should be further explored retrospectively on existing clinical trial datasets and, when sufficiently robust, be prospectively assessed for clinical relevance in terms of prognosis and treatment response predictive capacity. Original microarray data were uploaded to the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress/) under Accession Nos E-MTAB-990 and E-MTAB-1026.

Collapse

Mavridis L, Nath N, Mitchell JBO. PFClust: a novel parameter free clustering algorithm. BMC Bioinformatics 2013;14:213. [PMID: 23819480 PMCID: PMC3747858 DOI: 10.1186/1471-2105-14-213] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2013] [Accepted: 07/01/2013] [Indexed: 12/02/2022] Open

Abstract

Background

We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of ‘correct’ cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings.

Results

We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies – even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH.

Conclusions

We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.

Collapse

Giancarlo R, Lo Bosco G, Pinello L, Utro F. A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis. BMC Bioinformatics 2013;14 Suppl 1:S6. [PMID: 23369037 PMCID: PMC3548704 DOI: 10.1186/1471-2105-14-s1-s6] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research.

RESULTS

A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as Balanced Misclassification Index (BMI, for short) and of a nontrivial modification of the well known Receiver Operating Curve (ROC, for short), which we refer to as Corrected ROC (CROC, for short). The main results are: (a) a quantitative and qualitative method to describe the intrinsic separation ability of a distance; (b) a quantitative method to assess the performance of a clustering algorithm in conjunction with the intrinsic separation ability of a distance function. The proposed procedure is more informative than the ones available in the literature due to the adopted tools. Indeed, the first one allows to map distances and clustering solutions as graphical objects on a plane, and gives information about the bias of the clustering algorithm with respect to a distance. The second tool is a new external validity index which shows similar performances with respect to the state of the art, but with more flexibility, allowing for a broader spectrum of applications. In fact, it allows not only to quantify the merit of each clustering solution but also to quantify the agglomerative or divisive errors due to the algorithm.

CONCLUSIONS

The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance d2, Pearson correlation dr and mutual information dMI. Based on the results of the experiments, we have that the Euclidean and Pearson correlation distances have a good intrinsic discrimination ability. Conversely, the mutual information distance does not seem to offer the same flexibility and versatility as the other two distances. Apparently, that is due to well known problems in its estimation. since it requires that a dataset must have a substantial number of features to be reliable. Nevertheless, taking into account such a fact, together with results presented in Priness et al., one receives an indication that dMI may be superior to the other distances considered in this study only in conjunction with clustering algorithms specifically designed for its use. In addition, it results that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the discriminative ability of the distances considered in this study with respect to clustering. The methodology has a range of applicability that goes well beyond microarray data since it is independent of the nature of the input data. The only requirement is that the input data must have the same format of a "feature matrix". In particular it can be used to cluster ChIP-seq data.

Collapse

A systematic comparison of genome-scale clustering algorithms. BMC Bioinformatics 2012;13 Suppl 10:S7. [PMID: 22759431 PMCID: PMC3382433 DOI: 10.1186/1471-2105-13-s10-s7] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open

Abstract

Background

A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad range of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray co-expression matrices that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae.

Methods

For each clustering method under study, a variety of parameters were tested. Jaccard similarity was used to measure each cluster's agreement with every GO and KEGG annotation set, and the highest Jaccard score was assigned to the cluster. Clusters were grouped into small, medium, and large bins, and the Jaccard score of the top five scoring clusters in each bin were averaged and reported as the best average top 5 (BAT5) score for the particular method.

Results

Clusters produced by each method were evaluated based upon the positive match to known pathways. This produces a readily interpretable ranking of the relative effectiveness of clustering on the genes. Methods were also tested to determine whether they were able to identify clusters consistent with those identified by other clustering methods.

Conclusions

Validation of clusters against known gene classifications demonstrate that for this data, graph-based techniques outperform conventional clustering approaches, suggesting that further development and application of combinatorial strategies is warranted.

Collapse

Morris JH, Apeltsin L, Newman AM, Baumbach J, Wittkop T, Su G, Bader GD, Ferrin TE. clusterMaker: a multi-algorithm clustering plugin for Cytoscape. BMC Bioinformatics 2011;12:436. [PMID: 22070249 PMCID: PMC3262844 DOI: 10.1186/1471-2105-12-436] [Citation(s) in RCA: 414] [Impact Index Per Article: 31.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2011] [Accepted: 11/09/2011] [Indexed: 12/02/2022] Open

Abstract

Background

In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present clusterMaker, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. clusterMaker is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL.

Results

Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast Saccharomyces cerevisiae; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section.

Conclusions

The Cytoscape plugin clusterMaker provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the clusterMaker plugin. clusterMaker is available via the Cytoscape plugin manager.

Collapse

Ozcaglar C, Shabbeer A, Vandenberg S, Yener B, Bennett KP. Sublineage structure analysis of Mycobacterium tuberculosis complex strains using multiple-biomarker tensors. BMC Genomics 2011;12 Suppl 2:S1. [PMID: 21988942 PMCID: PMC3194230 DOI: 10.1186/1471-2164-12-s2-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Naegle KM, Welsch RE, Yaffe MB, White FM, Lauffenburger DA. MCAM: multiple clustering analysis methodology for deriving hypotheses and insights from high-throughput proteomic datasets. PLoS Comput Biol 2011;7:e1002119. [PMID: 21799663 PMCID: PMC3140961 DOI: 10.1371/journal.pcbi.1002119] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2011] [Accepted: 05/25/2011] [Indexed: 01/22/2023] Open

Abstract

Advances in proteomic technologies continue to substantially accelerate capability for generating experimental data on protein levels, states, and activities in biological samples. For example, studies on receptor tyrosine kinase signaling networks can now capture the phosphorylation state of hundreds to thousands of proteins across multiple conditions. However, little is known about the function of many of these protein modifications, or the enzymes responsible for modifying them. To address this challenge, we have developed an approach that enhances the power of clustering techniques to infer functional and regulatory meaning of protein states in cell signaling networks. We have created a new computational framework for applying clustering to biological data in order to overcome the typical dependence on specific a priori assumptions and expert knowledge concerning the technical aspects of clustering. Multiple clustering analysis methodology (‘MCAM’) employs an array of diverse data transformations, distance metrics, set sizes, and clustering algorithms, in a combinatorial fashion, to create a suite of clustering sets. These sets are then evaluated based on their ability to produce biological insights through statistical enrichment of metadata relating to knowledge concerning protein functions, kinase substrates, and sequence motifs. We applied MCAM to a set of dynamic phosphorylation measurements of the ERRB network to explore the relationships between algorithmic parameters and the biological meaning that could be inferred and report on interesting biological predictions. Further, we applied MCAM to multiple phosphoproteomic datasets for the ERBB network, which allowed us to compare independent and incomplete overlapping measurements of phosphorylation sites in the network. We report specific and global differences of the ERBB network stimulated with different ligands and with changes in HER2 expression. Overall, we offer MCAM as a broadly-applicable approach for analysis of proteomic data which may help increase the current understanding of molecular networks in a variety of biological problems.

Proteomic measurements, especially modification measurements, are greatly expanding the current knowledge of the state of proteins under various conditions. Harnessing these measurements to understand how these modifications are enzymatically regulated and their subsequent function in cellular signaling and physiology is a challenging new problem. Clustering has been very useful in reducing the dimensionality of many types of high-throughput biological data, as well inferring function of poorly understood molecular species. However, its implementation requires a great deal of technical expertise since there are a large number of parameters one must decide on in clustering, including data transforms, distance metrics, and algorithms. Previous knowledge of useful parameters does not exist for measurements of a new type. In this work we address two issues. First, we develop a framework that incorporates any number of possible parameters of clustering to produce a suite of clustering solutions. These solutions are then judged on their ability to infer biological information through statistical enrichment of existing biological annotations. Second, we apply this framework to dynamic phosphorylation measurements of the ERBB network, constructing the first extensive analysis of clustering of phosphoproteomic data and generating insight into novel components and novel functions of known components of the ERBB network.

Collapse

Albaum SP, Hahne H, Otto A, Haußmann U, Becher D, Poetsch A, Goesmann A, Nattkemper TW. A guide through the computational analysis of isotope-labeled mass spectrometry-based quantitative proteomics data: an application study. Proteome Sci 2011;9:30. [PMID: 21663690 PMCID: PMC3142201 DOI: 10.1186/1477-5956-9-30] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2011] [Accepted: 06/11/2011] [Indexed: 01/03/2023] Open

Echenique-Rivera H, Muzzi A, Del Tordello E, Seib KL, Francois P, Rappuoli R, Pizza M, Serruto D. Transcriptome analysis of Neisseria meningitidis in human whole blood and mutagenesis studies identify virulence factors involved in blood survival. PLoS Pathog 2011;7:e1002027. [PMID: 21589640 PMCID: PMC3088726 DOI: 10.1371/journal.ppat.1002027] [Citation(s) in RCA: 117] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2010] [Accepted: 02/26/2011] [Indexed: 12/14/2022] Open

Abstract

During infection Neisseria meningitidis (Nm) encounters multiple environments within the host, which makes rapid adaptation a crucial factor for meningococcal survival. Despite the importance of invasion into the bloodstream in the meningococcal disease process, little is known about how Nm adapts to permit survival and growth in blood. To address this, we performed a time-course transcriptome analysis using an ex vivo model of human whole blood infection. We observed that Nm alters the expression of ≈30% of ORFs of the genome and major dynamic changes were observed in the expression of transcriptional regulators, transport and binding proteins, energy metabolism, and surface-exposed virulence factors. In particular, we found that the gene encoding the regulator Fur, as well as all genes encoding iron uptake systems, were significantly up-regulated. Analysis of regulated genes encoding for surface-exposed proteins involved in Nm pathogenesis allowed us to better understand mechanisms used to circumvent host defenses. During blood infection, Nm activates genes encoding for the factor H binding proteins, fHbp and NspA, genes encoding for detoxifying enzymes such as SodC, Kat and AniA, as well as several less characterized surface-exposed proteins that might have a role in blood survival. Through mutagenesis studies of a subset of up-regulated genes we were able to identify new proteins important for survival in human blood and also to identify additional roles of previously known virulence factors in aiding survival in blood. Nm mutant strains lacking the genes encoding the hypothetical protein NMB1483 and the surface-exposed proteins NalP, Mip and NspA, the Fur regulator, the transferrin binding protein TbpB, and the L-lactate permease LctP were sensitive to killing by human blood. This increased knowledge of how Nm responds to adaptation in blood could also be helpful to develop diagnostic and therapeutic strategies to control the devastating disease cause by this microorganism.

Collapse

Giancarlo R, Utro F. Speeding up the Consensus Clustering methodology for microarray data analysis. Algorithms Mol Biol 2011;6:1. [PMID: 21235792 PMCID: PMC3035181 DOI: 10.1186/1748-7188-6-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2010] [Accepted: 01/14/2011] [Indexed: 11/10/2022] Open

Abstract

Background

The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up.

Results

Since the time-precision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that summarize key features of microarray applications, such as cancer studies, gene expression with up and down patterns, and a full spectrum of dimensionality up to over a thousand. Based on their outcome, compared with previous benchmarking results available in the literature, FC turns out to be among the fastest internal validation methods, while retaining the same outstanding precision of Consensus. Moreover, it also provides a consensus matrix that can be used as a dissimilarity matrix, guaranteeing the same performance as the corresponding matrix produced by Consensus. We have also experimented with the use of Consensus and FC in conjunction with NMF (Nonnegative Matrix Factorization), in order to identify the correct number of clusters in a dataset. Although NMF is an increasingly popular technique for biological data mining, our results are somewhat disappointing and complement quite well the state of the art about NMF, shedding further light on its merits and limitations.

Conclusions

In summary, FC with a parameter setting that makes it robust with respect to small and medium-sized datasets, i.e, number of items to cluster in the hundreds and number of conditions up to a thousand, seems to be the internal validation measure of choice. Moreover, the technique we have developed here can be used in other contexts, in particular for the speed-up of stability-based validation measures.

Collapse

Freyhult E, Landfors M, Önskog J, Hvidsten TR, Rydén P. Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering. BMC Bioinformatics 2010;11:503. [PMID: 20937082 PMCID: PMC3098084 DOI: 10.1186/1471-2105-11-503] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2010] [Accepted: 10/11/2010] [Indexed: 08/30/2023] Open

Abstract

BACKGROUND

Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle missing values, standardize the data and select genes. In addition, pre-processing, involving various types of filtration and normalization procedures, can have an effect on the ability to discover biologically relevant classes. Here we consider cluster analysis in a broad sense and perform a comprehensive evaluation that covers several aspects of cluster analyses, including normalization.

RESULT

We evaluated 2780 cluster analysis methods on seven publicly available 2-channel microarray data sets with common reference designs. Each cluster analysis method differed in data normalization (5 normalizations were considered), missing value imputation (2), standardization of data (2), gene selection (19) or clustering method (11). The cluster analyses are evaluated using known classes, such as cancer types, and the adjusted Rand index. The performances of the different analyses vary between the data sets and it is difficult to give general recommendations. However, normalization, gene selection and clustering method are all variables that have a significant impact on the performance. In particular, gene selection is important and it is generally necessary to include a relatively large number of genes in order to get good performance. Selecting genes with high standard deviation or using principal component analysis are shown to be the preferred gene selection methods. Hierarchical clustering using Ward's method, k-means clustering and Mclust are the clustering methods considered in this paper that achieves the highest adjusted Rand. Normalization can have a significant positive impact on the ability to cluster individuals, and there are indications that background correction is preferable, in particular if the gene selection is successful. However, this is an area that needs to be studied further in order to draw any general conclusions.

CONCLUSIONS

The choice of cluster analysis, and in particular gene selection, has a large impact on the ability to cluster individuals correctly based on expression profiles. Normalization has a positive effect, but the relative performance of different normalizations is an area that needs more research. In summary, although clustering, gene selection and normalization are considered standard methods in bioinformatics, our comprehensive analysis shows that selecting the right methods, and the right combinations of methods, is far from trivial and that much is still unexplored in what is considered to be the most basic analysis of genomic data.

Collapse

Olex AL, Hiltbold EM, Leng X, Fetrow JS. Dynamics of dendritic cell maturation are identified through a novel filtering strategy applied to biological time-course microarray replicates. BMC Immunol 2010;11:41. [PMID: 20682054 PMCID: PMC2928180 DOI: 10.1186/1471-2172-11-41] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2009] [Accepted: 08/03/2010] [Indexed: 01/04/2023] Open

Abstract

Background

Dendritic cells (DC) play a central role in primary immune responses and become potent stimulators of the adaptive immune response after undergoing the critical process of maturation. Understanding the dynamics of DC maturation would provide key insights into this important process. Time course microarray experiments can provide unique insights into DC maturation dynamics. Replicate experiments are necessary to address the issues of experimental and biological variability. Statistical methods and averaging are often used to identify significant signals. Here a novel strategy for filtering of replicate time course microarray data, which identifies consistent signals between the replicates, is presented and applied to a DC time course microarray experiment.

Results

The temporal dynamics of DC maturation were studied by stimulating DC with poly(I:C) and following gene expression at 5 time points from 1 to 24 hours. The novel filtering strategy uses standard statistical and fold change techniques, along with the consistency of replicate temporal profiles, to identify those differentially expressed genes that were consistent in two biological replicate experiments. To address the issue of cluster reproducibility a consensus clustering method, which identifies clusters of genes whose expression varies consistently between replicates, was also developed and applied. Analysis of the resulting clusters revealed many known and novel characteristics of DC maturation, such as the up-regulation of specific immune response pathways. Intriguingly, more genes were down-regulated than up-regulated. Results identify a more comprehensive program of down-regulation, including many genes involved in protein synthesis, metabolism, and housekeeping needed for maintenance of cellular integrity and metabolism.

Conclusions

The new filtering strategy emphasizes the importance of consistent and reproducible results when analyzing microarray data and utilizes consistency between replicate experiments as a criterion in both feature selection and clustering, without averaging or otherwise combining replicate data. Observation of a significant down-regulation program during DC maturation indicates that DC are preparing for cell death and provides a path to better understand the process. This new filtering strategy can be adapted for use in analyzing other large-scale time course data sets with replicates.

Collapse

Cabanski CR, Qi Y, Yin X, Bair E, Hayward MC, Fan C, Li J, Wilkerson MD, Marron JS, Perou CM, Hayes DN. SWISS MADE: Standardized WithIn Class Sum of Squares to evaluate methodologies and dataset elements. PLoS One 2010;5:e9905. [PMID: 20360852 PMCID: PMC2845619 DOI: 10.1371/journal.pone.0009905] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2009] [Accepted: 02/26/2010] [Indexed: 11/19/2022] Open

Abstract

Contemporary high dimensional biological assays, such as mRNA expression microarrays, regularly involve multiple data processing steps, such as experimental processing, computational processing, sample selection, or feature selection (i.e. gene selection), prior to deriving any biological conclusions. These steps can dramatically change the interpretation of an experiment. Evaluation of processing steps has received limited attention in the literature. It is not straightforward to evaluate different processing methods and investigators are often unsure of the best method. We present a simple statistical tool, Standardized WithIn class Sum of Squares (SWISS), that allows investigators to compare alternate data processing methods, such as different experimental methods, normalizations, or technologies, on a dataset in terms of how well they cluster a priori biological classes. SWISS uses Euclidean distance to determine which method does a better job of clustering the data elements based on a priori classifications. We apply SWISS to three different gene expression applications. The first application uses four different datasets to compare different experimental methods, normalizations, and gene sets. The second application, using data from the MicroArray Quality Control (MAQC) project, compares different microarray platforms. The third application compares different technologies: a single Agilent two-color microarray versus one lane of RNA-Seq. These applications give an indication of the variety of problems that SWISS can be helpful in solving. The SWISS analysis of one-color versus two-color microarrays provides investigators who use two-color arrays the opportunity to review their results in light of a single-channel analysis, with all of the associated benefits offered by this design. Analysis of the MACQ data shows differential intersite reproducibility by array platform. SWISS also shows that one lane of RNA-Seq clusters data by biological phenotypes as well as a single Agilent two-color microarray.

Collapse

Affiliation(s)

Christopher R. Cabanski Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina, United States of America
Yuan Qi Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
Xiaoying Yin Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America Department of Otolaryngology/Head and Neck Surgery, University of North Carolina, Chapel Hill, North Carolina, United States of America
Eric Bair School of Dentistry, University of North Carolina, Chapel Hill, North Carolina, United States of America Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina, United States of America
Michele C. Hayward Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
Cheng Fan Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
Jianying Li Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
Matthew D. Wilkerson Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
J. S. Marron Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, North Carolina, United States of America Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America
Charles M. Perou Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, United States of America Department of Pathology and Laboratory Medicine, University of North Carolina, Chapel Hill, North Carolina, United States of America
D. Neil Hayes Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina, United States of America Division of Medical Oncology, Department of Internal Medicine, University of North Carolina, Chapel Hill, North Carolina, United States of America

Collapse

Newman AM, Cooper JB. AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number. BMC Bioinformatics 2010;11:117. [PMID: 20202218 PMCID: PMC2846907 DOI: 10.1186/1471-2105-11-117] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2009] [Accepted: 03/04/2010] [Indexed: 12/25/2022] Open

Klie S, Nikoloski Z, Selbig J. Biological cluster evaluation for gene function prediction. J Comput Biol 2010;21:428-45. [PMID: 20059365 DOI: 10.1089/cmb.2009.0129] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Distance Functions, Clustering Algorithms and Microarray Data Analysis. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-13800-3_10] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]

Zhang M, Zhang W, Sicotte H, Yang P. A new validity measure for a correlation-based fuzzy c-means clustering algorithm. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009;2009:3865-8. [PMID: 19963601 DOI: 10.1109/iembs.2009.5332582] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]