1
|
Gao Y, Wang Z, Xie J, Pan J. A new robust fuzzy c-means clustering method based on adaptive elastic distance. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107769] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
2
|
Jiang Y, Gu X, Wu D, Hang W, Xue J, Qiu S, Lin CT. A Novel Negative-Transfer-Resistant Fuzzy Clustering Model With a Shared Cross-Domain Transfer Latent Space and its Application to Brain CT Image Segmentation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:40-52. [PMID: 31905144 DOI: 10.1109/tcbb.2019.2963873] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Traditional clustering algorithms for medical image segmentation can only achieve satisfactory clustering performance under relatively ideal conditions, in which there is adequate data from the same distribution, and the data is rarely disturbed by noise or outliers. However, a sufficient amount of medical images with representative manual labels are often not available, because medical images are frequently acquired with different scanners (or different scan protocols) or polluted by various noises. Transfer learning improves learning in the target domain by leveraging knowledge from related domains. Given some target data, the performance of transfer learning is determined by the degree of relevance between the source and target domains. To achieve positive transfer and avoid negative transfer, a negative-transfer-resistant mechanism is proposed by computing the weight of transferred knowledge. Extracting a negative-transfer-resistant fuzzy clustering model with a shared cross-domain transfer latent space (called NTR-FC-SCT) is proposed by integrating negative-transfer-resistant and maximum mean discrepancy (MMD) into the framework of fuzzy c-means clustering. Experimental results show that the proposed NTR-FC-SCT model outperformed several traditional non-transfer and related transfer clustering algorithms.
Collapse
|
3
|
Liu C, Li Y, Zhao Q, Liu C. Reference vector-based multi-objective clustering for high-dimensional data. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2019.02.043] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
4
|
Qian P, Jiang Y, Wang S, Su KH, Wang J, Hu L, Muzic RF. Affinity and Penalty Jointly Constrained Spectral Clustering With All-Compatibility, Flexibility, and Robustness. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:1123-1138. [PMID: 26915134 PMCID: PMC4990515 DOI: 10.1109/tnnls.2015.2511179] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
The existing, semisupervised, spectral clustering approaches have two major drawbacks, i.e., either they cannot cope with multiple categories of supervision or they sometimes exhibit unstable effectiveness. To address these issues, two normalized affinity and penalty jointly constrained spectral clustering frameworks as well as their corresponding algorithms, referred to as type-I affinity and penalty jointly constrained spectral clustering (TI-APJCSC) and type-II affinity and penalty jointly constrained spectral clustering (TII-APJCSC), respectively, are proposed in this paper. TI refers to type-I and TII to type-II. The significance of this paper is fourfold. First, benefiting from the distinctive affinity and penalty jointly constrained strategies, both TI-APJCSC and TII-APJCSC are substantially more effective than the existing methods. Second, both TI-APJCSC and TII-APJCSC are fully compatible with the three well-known categories of supervision, i.e., class labels, pairwise constraints, and grouping information. Third, owing to the delicate framework normalization, both TI-APJCSC and TII-APJCSC are quite flexible. With a simple tradeoff factor varying in the small fixed interval (0, 1], they can self-adapt to any semisupervised scenario. Finally, both TI-APJCSC and TII-APJCSC demonstrate strong robustness, not only to the number of pairwise constraints but also to the parameter for affinity measurement. As such, the novel TI-APJCSC and TII-APJCSC algorithms are very practical for medium- and small-scale semisupervised data sets. The experimental studies thoroughly evaluated and demonstrated these advantages on both synthetic and real-life semisupervised data sets.
Collapse
Affiliation(s)
- Pengjiang Qian
- School of Digital Media, Jiangnan University, Wuxi 214122, China
| | - Yizhang Jiang
- School of Digital Media, Jiangnan University, Wuxi 214122, China
| | - Shitong Wang
- School of Digital Media, Jiangnan University, Wuxi 214122, China
| | - Kuan-Hao Su
- Case Center for Imaging Research, Department of Radiology, University Hospitals, Case Western Reserve University, Cleveland, OH 44106 USA
| | - Jun Wang
- School of Mechanical Engineering, Jiangnan University, Wuxi 214122, China ()
| | - Lingzhi Hu
- Philips Electronics North America, Highland Heights, OH 44143 USA ()
| | - Raymond F. Muzic
- Case Center for Imaging Research, Department of Radiology, University Hospitals, Case Western Reserve University, Cleveland, OH 44106 USA
| |
Collapse
|
5
|
Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression. PLoS Comput Biol 2016; 12:e1004871. [PMID: 27177143 PMCID: PMC4866742 DOI: 10.1371/journal.pcbi.1004871] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 03/14/2016] [Indexed: 11/22/2022] Open
Abstract
By integrating Haar wavelets with Hidden Markov Models, we achieve drastically reduced running times for Bayesian inference using Forward-Backward Gibbs sampling. We show that this improves detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. The method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://schlieplab.org/Software/HaMMLET/ (DOI: 10.5281/zenodo.46262). This paper was selected for oral presentation at RECOMB 2016, and an abstract is published in the conference proceedings.
Collapse
|
6
|
Qian P, Sun S, Jiang Y, Su KH, Ni T, Wang S, Muzic RF. Cross-domain, soft-partition clustering with diversity measure and knowledge reference. PATTERN RECOGNITION 2016; 50:155-177. [PMID: 27275022 PMCID: PMC4892128 DOI: 10.1016/j.patcog.2015.08.009] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Conventional, soft-partition clustering approaches, such as fuzzy c-means (FCM), maximum entropy clustering (MEC) and fuzzy clustering by quadratic regularization (FC-QR), are usually incompetent in those situations where the data are quite insufficient or much polluted by underlying noise or outliers. In order to address this challenge, the quadratic weights and Gini-Simpson diversity based fuzzy clustering model (QWGSD-FC), is first proposed as a basis of our work. Based on QWGSD-FC and inspired by transfer learning, two types of cross-domain, soft-partition clustering frameworks and their corresponding algorithms, referred to as type-I/type-II knowledge-transfer-oriented c-means (TI-KT-CM and TII-KT-CM), are subsequently presented, respectively. The primary contributions of our work are four-fold: (1) The delicate QWGSD-FC model inherits the most merits of FCM, MEC and FC-QR. With the weight factors in the form of quadratic memberships, similar to FCM, it can more effectively calculate the total intra-cluster deviation than the linear form recruited in MEC and FC-QR. Meanwhile, via Gini-Simpson diversity index, like Shannon entropy in MEC, and equivalent to the quadratic regularization in FC-QR, QWGSD-FC is prone to achieving the unbiased probability assignments, (2) owing to the reference knowledge from the source domain, both TI-KT-CM and TII-KT-CM demonstrate high clustering effectiveness as well as strong parameter robustness in the target domain, (3) TI-KT-CM refers merely to the historical cluster centroids, whereas TII-KT-CM simultaneously uses the historical cluster centroids and their associated fuzzy memberships as the reference. This indicates that TII-KT-CM features more comprehensive knowledge learning capability than TI-KT-CM and TII-KT-CM consequently exhibits more perfect cross-domain clustering performance and (4) neither the historical cluster centroids nor the historical cluster centroid based fuzzy memberships involved in TI-KT-CM or TII-KT-CM can be inversely mapped into the raw data. This means that both TI-KT-CM and TII-KT-CM can work without disclosing the original data in the source domain, i.e. they are of good privacy protection for the source domain. In addition, the convergence analyses regarding both TI-KT-CM and TII-KT-CM are conducted in our research. The experimental studies thoroughly evaluated and demonstrated our contributions on both synthetic and real-life data scenarios.
Collapse
Affiliation(s)
- Pengjiang Qian
- School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China
- Case Center for Imaging Research, Case Western Reserve University, Cleveland, OH 44106, USA
- Department of Radiology, University Hospitals Case Medical Center, Case Western Reserve University, Cleveland, OH 44106, USA
- Corresponding author at: School of Digital Media, Jiangnan University, Wuxi, Jiangsu, China. Tel.: +86 137 71510961. (P. Qian)
| | - Shouwei Sun
- School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Yizhang Jiang
- School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Kuan-Hao Su
- Case Center for Imaging Research, Case Western Reserve University, Cleveland, OH 44106, USA
- Department of Radiology, University Hospitals Case Medical Center, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Tongguang Ni
- School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China
- School of Information Science and Engineering, Changzhou University, Changzhou, Jiangsu 213164, China
| | - Shitong Wang
- School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China
| | - Raymond F. Muzic
- Case Center for Imaging Research, Case Western Reserve University, Cleveland, OH 44106, USA
- Department of Radiology, University Hospitals Case Medical Center, Case Western Reserve University, Cleveland, OH 44106, USA
| |
Collapse
|
7
|
Qian P, Jiang Y, Deng Z, Hu L, Sun S, Wang S, Muzic RF. Cluster Prototypes and Fuzzy Memberships Jointly Leveraged Cross-Domain Maximum Entropy Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2016; 46:181-93. [PMID: 26684257 PMCID: PMC4882931 DOI: 10.1109/tcyb.2015.2399351] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
The classical maximum entropy clustering (MEC) algorithm usually cannot achieve satisfactory results in the situations where the data is insufficient, incomplete, or distorted. To address this problem, inspired by transfer learning, the specific cluster prototypes and fuzzy memberships jointly leveraged (CPM-JL) framework for cross-domain MEC (CDMEC) is firstly devised in this paper, and then the corresponding algorithm referred to as CPM-JL-CDMEC and the dedicated validity index named fuzzy memberships-based cross-domain difference measurement (FM-CDDM) are concurrently proposed. In general, the contributions of this paper are fourfold: 1) benefiting from the delicate CPM-JL framework, CPM-JL-CDMEC features high-clustering effectiveness and robustness even in some complex data situations; 2) the reliability of FM-CDDM has been demonstrated to be close to well-established external criteria, e.g., normalized mutual information and rand index, and it does not require additional label information. Hence, using FM-CDDM as a dedicated validity index significantly enhances the applicability of CPM-JL-CDMEC under realistic scenarios; 3) the performance of CPM-JL-CDMEC is generally better than, at least equal to, that of MEC because CPM-JL-CDMEC can degenerate into the standard MEC algorithm after adopting the proper parameters, and which avoids the issue of negative transfer; and 4) in order to maximize privacy protection, CPM-JL-CDMEC employs the known cluster prototypes and their associated fuzzy memberships rather than the raw data in the source domain as prior knowledge. The experimental studies thoroughly evaluated and demonstrated these advantages on both synthetic and real-life transfer datasets.
Collapse
Affiliation(s)
- Pengjiang Qian
- School of Digital Media, Jiangnan University, Wuxi 214122, China
| | - Yizhang Jiang
- School of Digital Media, Jiangnan University, Wuxi 214122, China
| | - Zhaohong Deng
- School of Digital Media, Jiangnan University, Wuxi 214122, China
| | - Lingzhi Hu
- Philips Electronics North America, Cleveland, OH 44143 USA
| | - Shouwei Sun
- School of Digital Media, Jiangnan University, Wuxi 214122, China
| | - Shitong Wang
- School of Digital Media, Jiangnan University, Wuxi 214122, China
| | - Raymond F. Muzic
- Department of Radiology and Case Center for Imaging Research, University Hospitals, Case Western Reserve University, Cleveland, OH 44106 USA
| |
Collapse
|
8
|
Yu Z, Chen H, You J, Liu J, Wong HS, Han G, Li L. Adaptive Fuzzy Consensus Clustering Framework for Clustering Analysis of Cancer Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:887-901. [PMID: 26357330 DOI: 10.1109/tcbb.2014.2359433] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Performing clustering analysis is one of the important research topics in cancer discovery using gene expression profiles, which is crucial in facilitating the successful diagnosis and treatment of cancer. While there are quite a number of research works which perform tumor clustering, few of them considers how to incorporate fuzzy theory together with an optimization process into a consensus clustering framework to improve the performance of clustering analysis. In this paper, we first propose a random double clustering based cluster ensemble framework (RDCCE) to perform tumor clustering based on gene expression data. Specifically, RDCCE generates a set of representative features using a randomly selected clustering algorithm in the ensemble, and then assigns samples to their corresponding clusters based on the grouping results. In addition, we also introduce the random double clustering based fuzzy cluster ensemble framework (RDCFCE), which is designed to improve the performance of RDCCE by integrating the newly proposed fuzzy extension model into the ensemble framework. RDCFCE adopts the normalized cut algorithm as the consensus function to summarize the fuzzy matrices generated by the fuzzy extension models, partition the consensus matrix, and obtain the final result. Finally, adaptive RDCFCE (A-RDCFCE) is proposed to optimize RDCFCE and improve the performance of RDCFCE further by adopting a self-evolutionary process (SEPP) for the parameter set. Experiments on real cancer gene expression profiles indicate that RDCFCE and A-RDCFCE works well on these data sets, and outperform most of the state-of-the-art tumor clustering algorithms.
Collapse
|
9
|
Jiang Y, Chung FL, Wang S, Deng Z, Wang J, Qian P. Collaborative fuzzy clustering from multiple weighted views. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:688-701. [PMID: 25069132 DOI: 10.1109/tcyb.2014.2334595] [Citation(s) in RCA: 79] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Clustering with multiview data is becoming a hot topic in data mining, pattern recognition, and machine learning. In order to realize an effective multiview clustering, two issues must be addressed, namely, how to combine the clustering result from each view and how to identify the importance of each view. In this paper, based on a newly proposed objective function which explicitly incorporates two penalty terms, a basic multiview fuzzy clustering algorithm, called collaborative fuzzy c-means (Co-FCM), is firstly proposed. It is then extended into its weighted view version, called weighted view collaborative fuzzy c-means (WV-Co-FCM), by identifying the importance of each view. The WV-Co-FCM algorithm indeed tackles the above two issues simultaneously. Its relationship with the latest multiview fuzzy clustering algorithm Collaborative Fuzzy K-Means (Co-FKM) is also revealed. Extensive experimental results on various multiview datasets indicate that the proposed WV-Co-FCM algorithm outperforms or is at least comparable to the existing state-of-the-art multitask and multiview clustering algorithms and the importance of different views of the datasets can be effectively identified.
Collapse
|
10
|
Brito I, Hupé P, Neuvial P, Barillot E. Stability-based comparison of class discovery methods for DNA copy number profiles. PLoS One 2013; 8:e81458. [PMID: 24339933 PMCID: PMC3855312 DOI: 10.1371/journal.pone.0081458] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2010] [Accepted: 10/22/2013] [Indexed: 11/19/2022] Open
Abstract
MOTIVATION Array-CGH can be used to determine DNA copy number, imbalances in which are a fundamental factor in the genesis and progression of tumors. The discovery of classes with similar patterns of array-CGH profiles therefore adds to our understanding of cancer and the treatment of patients. Various input data representations for array-CGH, dissimilarity measures between tumor samples and clustering algorithms may be used for this purpose. The choice between procedures is often difficult. An evaluation procedure is therefore required to select the best class discovery method (combination of one input data representation, one dissimilarity measure and one clustering algorithm) for array-CGH. Robustness of the resulting classes is a common requirement, but no stability-based comparison of class discovery methods for array-CGH profiles has ever been reported. RESULTS We applied several class discovery methods and evaluated the stability of their solutions, with a modified version of Bertoni's [Formula: see text]-based test [1]. Our version relaxes the assumption of independency required by original Bertoni's [Formula: see text]-based test. We conclude that Minimal Regions of alteration (a concept introduced by [2]) for input data representation, sim [3] or agree [4] for dissimilarity measure and the use of average group distance in the clustering algorithm produce the most robust classes of array-CGH profiles. AVAILABILITY The software is available from http://bioinfo.curie.fr/projects/cgh-clustering. It has also been partly integrated into "Visualization and analysis of array-CGH"(VAMP)[5]. The data sets used are publicly available from ACTuDB [6].
Collapse
Affiliation(s)
- Isabel Brito
- Institut Curie, Paris, France
- INSERM, U900, Paris, France
- Mines ParisTech, Fontainebleau, France
| | - Philippe Hupé
- Institut Curie, Paris, France
- INSERM, U900, Paris, France
- Mines ParisTech, Fontainebleau, France
- CNRS UMR144, Paris, France
| | - Pierre Neuvial
- Laboratoire Statistique & Génome, Université d′Évry Val d′Essonne, UMR CNRS 8071-USC INRA, Évry, France
| | - Emmanuel Barillot
- Institut Curie, Paris, France
- INSERM, U900, Paris, France
- Mines ParisTech, Fontainebleau, France
| |
Collapse
|
11
|
Functional performance of aCGH design for clinical cytogenetics. Comput Biol Med 2013; 43:775-85. [PMID: 23668354 DOI: 10.1016/j.compbiomed.2013.02.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2011] [Revised: 02/03/2013] [Accepted: 02/05/2013] [Indexed: 12/30/2022]
Abstract
Array-comparative genomic hybridization (aCGH) technology enables rapid, high-resolution analysis of genomic rearrangements. With the use of it, genome copy number changes and rearrangement breakpoints can be detected and analyzed at resolutions down to a few kilobases. An exon array CGH approach proposed recently accurately measures copy-number changes of individual exons in the human genome. The crucial and highly non-trivial starting task is the design of an array, i.e. the choice of appropriate (multi)set of oligos. The success of the whole high-level analysis depends on the quality of the design. Also, the comparison of several alternative designs of array CGH constitutes an important step in development of new diagnostic chip. In this paper, we deal with these two often neglected issues. We propose a new approach to measure the quality of array CGH designs. Our measures reflect the robustness of rearrangements detection to the noise (mostly experimental measurement error). The method is parametrized by the segmentation algorithm used to identify aberrations. We implemented the efficient Monte Carlo method for testing noise robustness within DNAcopy procedure. Developed framework has been applied to evaluation of functional quality of several optimized array designs.
Collapse
|
12
|
Yu Z, You J, Li L, Wong HS, Han G. Representative Distance: A New Similarity Measure for Class Discovery From Gene Expression Data. IEEE Trans Nanobioscience 2012; 11:341-51. [DOI: 10.1109/tnb.2012.2208198] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
13
|
Cheng YK, Beroukhim R, Levine RL, Mellinghoff IK, Holland EC, Michor F. A mathematical methodology for determining the temporal order of pathway alterations arising during gliomagenesis. PLoS Comput Biol 2012; 8:e1002337. [PMID: 22241976 PMCID: PMC3252265 DOI: 10.1371/journal.pcbi.1002337] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2011] [Accepted: 11/17/2011] [Indexed: 12/31/2022] Open
Abstract
Human cancer is caused by the accumulation of genetic alterations in cells. Of special importance are changes that occur early during malignant transformation because they may result in oncogene addiction and thus represent promising targets for therapeutic intervention. We have previously described a computational approach, called Retracing the Evolutionary Steps in Cancer (RESIC), to determine the temporal sequence of genetic alterations during tumorigenesis from cross-sectional genomic data of tumors at their fully transformed stage. Since alterations within a set of genes belonging to a particular signaling pathway may have similar or equivalent effects, we applied a pathway-based systems biology approach to the RESIC methodology. This method was used to determine whether alterations of specific pathways develop early or late during malignant transformation. When applied to primary glioblastoma (GBM) copy number data from The Cancer Genome Atlas (TCGA) project, RESIC identified a temporal order of pathway alterations consistent with the order of events in secondary GBMs. We then further subdivided the samples into the four main GBM subtypes and determined the relative contributions of each subtype to the overall results: we found that the overall ordering applied for the proneural subtype but differed for mesenchymal samples. The temporal sequence of events could not be identified for neural and classical subtypes, possibly due to a limited number of samples. Moreover, for samples of the proneural subtype, we detected two distinct temporal sequences of events: (i) RAS pathway activation was followed by TP53 inactivation and finally PI3K2 activation, and (ii) RAS activation preceded only AKT activation. This extension of the RESIC methodology provides an evolutionary mathematical approach to identify the temporal sequence of pathway changes driving tumorigenesis and may be useful in guiding the understanding of signaling rearrangements in cancer development. Cancer is a deadly disease that develops through the accumulation of genetic changes over time. Many biological models do not incorporate this temporal aspect of tumor formation and progression, in part due to the difficulty of determining the sequence of events through biological experimentation for most cancer types. We previously developed a computational algorithm with which we can quickly and cost-effectively determine the order in which mutations arise in the tumor even when large numbers of mutations are considered. In this paper, we extended our method to incorporate biological knowledge of the common pathways by which cancer progresses. We applied these techniques to primary glioblastoma, the most common form of brain cancer. We found that when all samples are taken into account, a temporal sequence of pathway events emerges; however, different subtypes of glioblastoma vary in their temporal sequence of events. This algorithm can also be easily applied to other cancer types as clinical data becomes available, showing the benefit of computational and mathematical tools in cancer research. Using temporal information, cancer biologists will be able to develop more accurate animal models of tumor formation and learn more about how mutations interact in time, thus leading to better treatments for cancer.
Collapse
Affiliation(s)
- Yu-Kang Cheng
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, and Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
- Cancer Biology and Genetics Program, Brain Tumor Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
- Tri-Institutional Training Program in Computational Biology and Medicine, Weill Cornell Medical College, New York, New York, United States of America
| | - Rameen Beroukhim
- Departments of Cancer Biology and Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America, Department of Medicine, Harvard Medical School, Boston, Massachusetts, United States of America, Department of Medicine, Brigham and Women's Hospital, Brigham and Women's Hospital, Boston, Massachusetts, United States of America, and Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Ross L. Levine
- Human Oncology and Pathogenesis Program, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Ingo K. Mellinghoff
- Human Oncology and Pathogenesis Program, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Eric C. Holland
- Cancer Biology and Genetics Program, Brain Tumor Center, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Franziska Michor
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, and Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
14
|
Deng Z, Choi KS, Chung FL, Wang S. EEW-SC: Enhanced Entropy-Weighting Subspace Clustering for high dimensional gene expression data clustering analysis. Appl Soft Comput 2011. [DOI: 10.1016/j.asoc.2011.07.002] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
15
|
Picard F, Lebarbier E, Hoebeke M, Rigaill G, Thiam B, Robin S. Joint segmentation, calling, and normalization of multiple CGH profiles. Biostatistics 2011; 12:413-28. [PMID: 21209153 DOI: 10.1093/biostatistics/kxq076] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
The statistical analysis of array comparative genomic hybridization (CGH) data has now shifted to the joint assessment of copy number variations at the cohort level. Considering multiple profiles gives the opportunity to correct for systematic biases observed on single profiles, such as probe GC content or the so-called "wave effect." In this article, we extend the segmentation model developed in the univariate case to the joint analysis of multiple CGH profiles. Our contribution is multiple: we propose an integrated model to perform joint segmentation, normalization, and calling for multiple array CGH profiles. This model shows great flexibility, especially in the modeling of the wave effect that gives a likelihood framework to approaches proposed by others. We propose a new dynamic programming algorithm for break point positioning, as well as a model selection criterion based on a modified bayesian information criterion proposed in the univariate case. The performance of our method is assessed using simulated and real data sets. Our method is implemented in the R package cghseg.
Collapse
Affiliation(s)
- Franck Picard
- Laboratoire de Biometrie et Biologie Evolutive, UMR CNRS 5558 - Univ. Lyon 1, F-69622, Villeurbanne, France.
| | | | | | | | | | | |
Collapse
|
16
|
Tolliver D, Tsourakakis C, Subramanian A, Shackney S, Schwartz R. Robust unmixing of tumor states in array comparative genomic hybridization data. Bioinformatics 2010; 26:i106-14. [PMID: 20529894 PMCID: PMC2881397 DOI: 10.1093/bioinformatics/btq213] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Tumorigenesis is an evolutionary process by which tumor cells acquire sequences of mutations leading to increased growth, invasiveness and eventually metastasis. It is hoped that by identifying the common patterns of mutations underlying major cancer sub-types, we can better understand the molecular basis of tumor development and identify new diagnostics and therapeutic targets. This goal has motivated several attempts to apply evolutionary tree reconstruction methods to assays of tumor state. Inference of tumor evolution is in principle aided by the fact that tumors are heterogeneous, retaining remnant populations of different stages along their development along with contaminating healthy cell populations. In practice, though, this heterogeneity complicates interpretation of tumor data because distinct cell types are conflated by common methods for assaying the tumor state. We previously proposed a method to computationally infer cell populations from measures of tumor-wide gene expression through a geometric interpretation of mixture type separation, but this approach deals poorly with noisy and outlier data. RESULTS In the present work, we propose a new method to perform tumor mixture separation efficiently and robustly to an experimental error. The method builds on the prior geometric approach but uses a novel objective function allowing for robust fits that greatly reduces the sensitivity to noise and outliers. We further develop an efficient gradient optimization method to optimize this 'soft geometric unmixing' objective for measurements of tumor DNA copy numbers assessed by array comparative genomic hybridization (aCGH) data. We show, on a combination of semi-synthetic and real data, that the method yields fast and accurate separation of tumor states. CONCLUSIONS We have shown a novel objective function and optimization method for the robust separation of tumor sub-types from aCGH data and have shown that the method provides fast, accurate reconstruction of tumor states from mixed samples. Better solutions to this problem can be expected to improve our ability to accurately identify genetic abnormalities in primary tumor samples and to infer patterns of tumor evolution. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Tolliver
- Computer Science Department, Carnegie Mellon University, Pittsburgh PA 15213, USA.
| | | | | | | | | |
Collapse
|
17
|
Kim KY, Kim J, Kim HJ, Nam W, Cha IH. A method for detecting significant genomic regions associated with oral squamous cell carcinoma using aCGH. Med Biol Eng Comput 2010; 48:459-68. [DOI: 10.1007/s11517-010-0595-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2009] [Accepted: 02/26/2010] [Indexed: 12/14/2022]
|
18
|
van de Wiel MA, Picard F, van Wieringen WN, Ylstra B. Preprocessing and downstream analysis of microarray DNA copy number profiles. Brief Bioinform 2010; 12:10-21. [PMID: 20172948 DOI: 10.1093/bib/bbq004] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Analysis of DNA copy number profiles requires methods tailored to the specific nature of these data. The number of available data analysis methods has grown enormously in the last 5 years. We discuss the typical characteristics of DNA copy number data, as measured by microarray technology and review the extensive literature on preprocessing methods such as segmentation and calling. Subsequently, the focus narrows to applications of DNA copy number in cancer, in particular, several downstream analyses of multi-sample data sets such as testing, clustering and classification. Finally, we look ahead: what should we prepare for and which methodology-related topics may deserve attention in the near future?
Collapse
Affiliation(s)
- Mark A van de Wiel
- Department of Epidemiology & Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.
| | | | | | | |
Collapse
|
19
|
Tang J, Le S, Sun L, Yan X, Zhang M, Macleod J, Leroy B, Northrup N, Ellis A, Yeatman TJ, Liang Y, Zwick ME, Zhao S. Copy number abnormalities in sporadic canine colorectal cancers. Genome Res 2010; 20:341-50. [PMID: 20086242 DOI: 10.1101/gr.092726.109] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Human colorectal cancer (CRC) is one of the better-understood systems for studying the genetics of cancer initiation and progression. To develop a cross-species comparison strategy for identifying CRC causative gene or genomic alterations, we performed array comparative genomic hybridization (aCGH) to investigate copy number abnormalities (CNAs), one of the most prominent lesion types reported for human CRCs, in 10 spontaneously occurring canine CRCs. The results revealed for the first time a strong degree of genetic homology between sporadic canine and human CRCs. First, we saw that between 5% and 22% of the canine genome was amplified/deleted in these tumors, and that, reminiscent of human CRCs, the total altered sequences directly correlated to the tumor's progression stage, origin, and likely microsatellite instability status. Second, when mapping the identified CNAs onto syntenic regions of the human genome, we noted that the canine orthologs of genes participating in known human CRC pathways were recurrently disrupted, indicating that these pathways might be altered in the canine CRCs as well. Last, we observed a significant overlapping of CNAs between human and canine tumors, and tumors from the two species were clustered according to the tumor subtypes but not the species. Significantly, compared with the shared CNAs, we found that species-specific (especially human-specific) CNAs localize to evolutionarily unstable regions that harbor more segmental duplications and interspecies genomic rearrangement breakpoints. These findings indicate that CNAs recurrent between human and dog CRCs may have a higher probability of being cancer-causative, compared with CNAs found in one species only.
Collapse
Affiliation(s)
- Jie Tang
- Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, Georgia 30602, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Kim KY, Lee GY, Kim J, Jeung HC, Chung HC, Rha SY. Identification of significant regional genetic variations using continuous CNV values in aCGH data. Genomics 2009; 94:317-23. [DOI: 10.1016/j.ygeno.2009.08.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2009] [Revised: 07/20/2009] [Accepted: 08/11/2009] [Indexed: 11/26/2022]
|
21
|
van Houte BPP, Heringa J. Accurate confidence aware clustering of array CGH tumor profiles. ACTA ACUST UNITED AC 2009; 26:6-14. [PMID: 19846437 DOI: 10.1093/bioinformatics/btp603] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Chromosomal aberrations tend to be characteristic for given (sub)types of cancer. Such aberrations can be detected with array comparative genomic hybridization (aCGH). Clustering aCGH tumor profiles aids in identifying chromosomal regions of interest and provides useful diagnostic information on the cancer type. An important issue here is to what extent individual aCGH tumor profiles can be reliably assigned to clusters associated with a given cancer type. RESULTS We introduce a novel evolutionary fuzzy clustering (EFC) algorithm, which is able to deal with overlapping clusterings. Our method assesses these overlaps by using cluster membership degrees, which we use here as a confidence measure for individual samples to be assigned to a given tumor type. We first demonstrate the usefulness of our method using a synthetic aCGH dataset and subsequently show that EFC outperforms existing methods on four real datasets of aCGH tumor profiles involving four different cancer types. We also show that in general best performance is obtained using 1- Pearson correlation coefficient as a distance measure and that extra preprocessing steps, such as segmentation and calling, lead to decreased clustering performance. AVAILABILITY The source code of the program is available from http://ibi.vu.nl/programs/efcwww
Collapse
Affiliation(s)
- Bart P P van Houte
- Centre for Integrative Bioinformatics VU (IBIVU), Faculty of Sciences and Faculty of Earth and Life Sciences, VU University Amsterdam, De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands
| | | |
Collapse
|
22
|
Rueda OM, Diaz-Uriarte R. Detection of recurrent copy number alterations in the genome: taking among-subject heterogeneity seriously. BMC Bioinformatics 2009; 10:308. [PMID: 19775444 PMCID: PMC2760535 DOI: 10.1186/1471-2105-10-308] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2009] [Accepted: 09/23/2009] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Alterations in the number of copies of genomic DNA that are common or recurrent among diseased individuals are likely to contain disease-critical genes. Unfortunately, defining common or recurrent copy number alteration (CNA) regions remains a challenge. Moreover, the heterogeneous nature of many diseases requires that we search for common or recurrent CNA regions that affect only some subsets of the samples (without knowledge of the regions and subsets affected), but this is neglected by most methods. RESULTS We have developed two methods to define recurrent CNA regions from aCGH data. Our methods are unique and qualitatively different from existing approaches: they detect regions over both the complete set of arrays and alterations that are common only to some subsets of the samples (i.e., alterations that might characterize previously unknown groups); they use probabilities of alteration as input and return probabilities of being a common region, thus allowing researchers to modify thresholds as needed; the two parameters of the methods have an immediate, straightforward, biological interpretation. Using data from previous studies, we show that we can detect patterns that other methods miss and that researchers can modify, as needed, thresholds of immediate interpretability and develop custom statistics to answer specific research questions. CONCLUSION These methods represent a qualitative advance in the location of recurrent CNA regions, highlight the relevance of population heterogeneity for definitions of recurrence, and can facilitate the clustering of samples with respect to patterns of CNA. Ultimately, the methods developed can become important tools in the search for genomic regions harboring disease-critical genes.
Collapse
Affiliation(s)
- Oscar M Rueda
- Structural and Computational Biology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro 3, 28029 Madrid, Spain
- Breast Cancer Functional Genomics, Cancer Research UK, Cambridge, UK
| | - Ramon Diaz-Uriarte
- Structural and Computational Biology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro 3, 28029 Madrid, Spain
| |
Collapse
|
23
|
Gerstung M, Baudis M, Moch H, Beerenwinkel N. Quantifying cancer progression with conjunctive Bayesian networks. Bioinformatics 2009; 25:2809-15. [PMID: 19692554 PMCID: PMC2781752 DOI: 10.1093/bioinformatics/btp505] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Cancer is an evolutionary process characterized by accumulating mutations. However, the precise timing and the order of genetic alterations that drive tumor progression remain enigmatic. RESULTS We present a specific probabilistic graphical model for the accumulation of mutations and their interdependencies. The Bayesian network models cancer progression by an explicit unobservable accumulation process in time that is separated from the observable but error-prone detection of mutations. Model parameters are estimated by an Expectation-Maximization algorithm and the underlying interaction graph is obtained by a simulated annealing procedure. Applying this method to cytogenetic data for different cancer types, we find multiple complex oncogenetic pathways deviating substantially from simplified models, such as linear pathways or trees. We further demonstrate how the inferred progression dynamics can be used to improve genetics-based survival predictions which could support diagnostics and prognosis. AVAILABILITY The software package ct-cbn is available under a GPL license on the web site cbg.ethz.ch/software/ct-cbn CONTACT moritz.gerstung@bsse.ethz.ch.
Collapse
Affiliation(s)
- Moritz Gerstung
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstrasse 26, 4058 Basel, Switzerland.
| | | | | | | |
Collapse
|
24
|
Abstract
UNLABELLED Recurrent chromosomal alterations provide cytological and molecular positions for the diagnosis and prognosis of cancer. Comparative genomic hybridization (CGH) has been useful in understanding these alterations in cancerous cells. CGH datasets consist of samples that are represented by large dimensional arrays of intervals. Each sample consists of long runs of intervals with losses and gains. In this article, we develop novel SVM-based methods for classification and feature selection of CGH data. For classification, we developed a novel similarity kernel that is shown to be more effective than the standard linear kernel used in SVM. For feature selection, we propose a novel method based on the new kernel that iteratively selects features that provides the maximum benefit for classification. We compared our methods against the best wrapper-based and filter-based approaches that have been used for feature selection of large dimensional biological data. Our results on datasets generated from the Progenetix database, suggests that our methods are considerably superior to existing methods. AVAILABILITY All software developed in this article can be downloaded from http://plaza.ufl.edu/junliu/feature.tar.gz.
Collapse
Affiliation(s)
- Jun Liu
- Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA.
| | | | | |
Collapse
|
25
|
Affiliation(s)
- Wessel N Van Wieringen
- Department of Mathematics, Vrije Universiteit, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands.
| | | | | |
Collapse
|
26
|
Baudis M. Genomic imbalances in 5918 malignant epithelial tumors: an explorative meta-analysis of chromosomal CGH data. BMC Cancer 2007; 7:226. [PMID: 18088415 PMCID: PMC2225423 DOI: 10.1186/1471-2407-7-226] [Citation(s) in RCA: 114] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2007] [Accepted: 12/18/2007] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Chromosomal abnormalities have been associated with most human malignancies, with gains and losses on some genomic regions associated with particular entities. METHODS Of the 15429 cases collected for the Progenetix molecular-cytogenetic database, 5918 malignant epithelial neoplasias analyzed by chromosomal Comparative Genomic Hybridization (CGH) were selected for further evaluation. For the 22 clinico-pathological entities with more than 50 cases, summary profiles for genomic imbalances were generated from case specific data and analyzed. RESULTS With large variation in overall genomic instability, recurring genomic gains and losses were prominent. Most entities showed frequent gains involving 8q2, while gains on 20q, 1q, 3q, 5p, 7q and 17q were frequent in different entities. Loss "hot spots" included 3p, 4q, 13q, 17p and 18q among others. Related average imbalance patterns were found for clinically distinct entities, e.g. hepatocellular carcinomas (ca.) and ductal breast ca., as well as for histologically related entities (squamous cell ca. of different sites). CONCLUSION Although considerable case-by-case variation of genomic profiles can be found by CGH in epithelial malignancies, a limited set of variously combined chromosomal imbalances may be typical for carcinogenesis. Focus on the respective regions should aid in target gene detection and pathway deduction.
Collapse
Affiliation(s)
- Michael Baudis
- Institute of Molecular Biology, University of Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Germany.
| |
Collapse
|
27
|
Shah SP, Lam WL, Ng RT, Murphy KP. Modeling recurrent DNA copy number alterations in array CGH data. ACTA ACUST UNITED AC 2007; 23:i450-8. [PMID: 17646330 DOI: 10.1093/bioinformatics/btm221] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Recurrent DNA copy number alterations (CNA) measured with array comparative genomic hybridization (aCGH) reveal important molecular features of human genetics and disease. Studying aCGH profiles from a phenotypic group of individuals can determine important recurrent CNA patterns that suggest a strong correlation to the phenotype. Computational approaches to detecting recurrent CNAs from a set of aCGH experiments have typically relied on discretizing the noisy log ratios and subsequently inferring patterns. We demonstrate that this can have the effect of filtering out important signals present in the raw data. In this article we develop statistical models that jointly infer CNA patterns and the discrete labels by borrowing statistical strength across samples. RESULTS We propose extending single sample aCGH HMMs to the multiple sample case in order to infer shared CNAs. We model recurrent CNAs as a profile encoded by a master sequence of states that generates the samples. We show how to improve on two basic models by performing joint inference of the discrete labels and providing sparsity in the output. We demonstrate on synthetic ground truth data and real data from lung cancer cell lines how these two important features of our model improve results over baseline models. We include standard quantitative metrics and a qualitative assessment on which to base our conclusions. AVAILABILITY http://www.cs.ubc.ca/~sshah/acgh.
Collapse
Affiliation(s)
- Sohrab P Shah
- Department of Computer Science, University of British Columbia, 201-2366 Main Mall Vancouver, BC V6T 1Z4 Canada.
| | | | | | | |
Collapse
|
28
|
Yu T, Ye H, Sun W, Li KC, Chen Z, Jacobs S, Bailey DK, Wong DT, Zhou X. A forward-backward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using high-density single nucleotide polymorphism (SNP) array. BMC Bioinformatics 2007; 8:145. [PMID: 17477871 PMCID: PMC1868765 DOI: 10.1186/1471-2105-8-145] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2006] [Accepted: 05/03/2007] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND DNA copy number aberration (CNA) is one of the key characteristics of cancer cells. Recent studies demonstrated the feasibility of utilizing high density single nucleotide polymorphism (SNP) genotyping arrays to detect CNA. Compared with the two-color array-based comparative genomic hybridization (array-CGH), the SNP arrays offer much higher probe density and lower signal-to-noise ratio at the single SNP level. To accurately identify small segments of CNA from SNP array data, segmentation methods that are sensitive to CNA while resistant to noise are required. RESULTS We have developed a highly sensitive algorithm for the edge detection of copy number data which is especially suitable for the SNP array-based copy number data. The method consists of an over-sensitive edge-detection step and a test-based forward-backward edge selection step. CONCLUSION Using simulations constructed from real experimental data, the method shows high sensitivity and specificity in detecting small copy number changes in focused regions. The method is implemented in an R package FASeg, which includes data processing and visualization utilities, as well as libraries for processing Affymetrix SNP array data.
Collapse
Affiliation(s)
- Tianwei Yu
- Department of Biostatistics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Hui Ye
- Center for Molecular Biology of Oral Diseases, College of Dentistry, University of Illinois at Chicago, Chicago, IL, USA
- Shanghai Children's Medical Center, Shanghai Jiao-Tong University, Shanghai, China
| | - Wei Sun
- Department of Statistics, University of California at Los Angeles, CA, USA
| | - Ker-Chau Li
- Department of Statistics, University of California at Los Angeles, CA, USA
| | - Zugen Chen
- Department of Human Genetics & Microarray Core, University of California at Los Angeles, Los Angeles, CA, USA
| | - Sharoni Jacobs
- Affymetrix, Inc., 3420 Central Expressway, Santa Clara, CA, USA
| | - Dione K Bailey
- Affymetrix, Inc., 3420 Central Expressway, Santa Clara, CA, USA
| | - David T Wong
- Dental Research Institute, School of Dentistry, David Geffen School of Medicine & Henry Samueli School of Engineering & Jonsson Comprehensive Cancer Center, University of California at Los Angeles, Los Angeles, CA, USA
| | - Xiaofeng Zhou
- Center for Molecular Biology of Oral Diseases, College of Dentistry, University of Illinois at Chicago, Chicago, IL, USA
- Guanghua School & Research Institute of Stomatology, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
29
|
Abstract
MOTIVATION We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples using similarity based clustering methods. A key requirement for clustering is to avoid using the noisy aberrations in the CGH samples. RESULTS We develop a dynamic programming algorithm to identify a small set of important genomic intervals called markers. The advantage of using these markers is that the potentially noisy genomic intervals are excluded during the clustering process. We also develop two clustering strategies using these markers. The first one, prototype-based approach, maximizes the support for the markers. The second one, similarity-based approach, develops a new similarity measure called RSim and refines clusters with the aim of maximizing the RSim measure between the samples in the same cluster. Our results demonstrate that the markers we found represent the aberration patterns of cancer types well and they improve the quality of clustering significantly. AVAILABILITY All software developed in this paper and all the datasets used are available from the authors upon request.
Collapse
Affiliation(s)
- Jun Liu
- Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA.
| | | | | |
Collapse
|