1
|
Li J, He L, Zhang X, Li X, Wang L, Zhu Z, Song K, Wang X. GCclassifier: An R package for the prediction of molecular subtypes of gastric cancer. Comput Struct Biotechnol J 2024; 23:752-758. [PMID: 38304548 PMCID: PMC10831507 DOI: 10.1016/j.csbj.2024.01.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/03/2024] Open
Abstract
Gastric cancer (GC) is one of the most commonly diagnosed malignancies, threatening millions of lives worldwide each year. Importantly, GC is a heterogeneous disease, posing a significant challenge to the selection of patients for more optimized therapy. Over the last decades, extensive community effort has been spent on dissecting the heterogeneity of GC, leading to the identification of distinct molecular subtypes that are clinically relevant. However, so far, no tool is publicly available for GC subtype prediction, hindering the research into GC subtype-specific biological mechanisms, the design of novel targeted agents, and potential clinical applications. To address the unmet need, we developed an R package GCclassifier for predicting GC molecular subtypes based on gene expression profiles. To facilitate the use by non-bioinformaticians, we also provide an interactive, user-friendly web server implementing the major functionalities of GCclassifier. The predictive performance of GCclassifier was demonstrated using case studies on multiple independent datasets.
Collapse
Affiliation(s)
- Jiang Li
- Department of Surgery, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Department of Biomedical Sciences, City University of Hong Kong, Hong Kong Special Administrative Region of China
| | - Lingli He
- Department of Surgery, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Department of Biomedical Sciences, City University of Hong Kong, Hong Kong Special Administrative Region of China
| | - Xianrui Zhang
- Department of Surgery, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Department of Biomedical Sciences, City University of Hong Kong, Hong Kong Special Administrative Region of China
| | - Xiang Li
- Department of Surgery, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
| | - Lishi Wang
- Department of Surgery, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
| | - Zhongxu Zhu
- Department of Surgery, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- HIM-BGI Omics Center, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, China
| | - Kai Song
- Department of Surgery, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, Region of China
| | - Xin Wang
- Department of Surgery, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong Special Administrative Region of China
- Shenzhen Research Institute, The Chinese University of Hong Kong, Shenzhen, Region of China
| |
Collapse
|
2
|
Liu F, Yang Y, Xu XS, Yuan M. MESBC: A novel mutually exclusive spectral biclustering method for cancer subtyping. Comput Biol Chem 2024; 109:108009. [PMID: 38219419 DOI: 10.1016/j.compbiolchem.2023.108009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 12/22/2023] [Accepted: 12/24/2023] [Indexed: 01/16/2024]
Abstract
Many soft biclustering algorithms have been developed and applied to various biological and biomedical data analyses. However, few mutually exclusive (hard) biclustering algorithms have been proposed, which could better identify disease or molecular subtypes with survival significance based on genomic or transcriptomic data. In this study, we developed a novel mutually exclusive spectral biclustering (MESBC) algorithm based on spectral method to detect mutually exclusive biclusters. MESBC simultaneously detects relevant features (genes) and corresponding conditions (patients) subgroups and, therefore, automatically uses the signature features for each subtype to perform the clustering. Extensive simulations revealed that MESBC provided superior accuracy in detecting pre-specified biclusters compared with the non-negative matrix factorization (NMF) and Dhillon's algorithm, particularly in very noisy data. Further analysis of the algorithm on real datasets obtained from the TCGA database showed that MESBC provided more accurate (i.e., smaller p-value) overall survival prediction in patients with lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) cancers when compared to the existing, gold-standard subtypes for lung cancers (integrative clustering). Furthermore, MESBC detected several genes with significant prognostic value in both LUAD and LUSC patients. External validation on an independent, unseen GEO dataset of LUAD showed that MESBC-derived clusters based on TCGA data still exhibited clear biclustering patterns and consistent, outstanding prognostic predictability, demonstrating robust generalizability of MESBC. Therefore, MESBC could potentially be used as a risk stratification tool to optimize the treatment for the patient, improve the selection of patients for clinical trials, and contribute to the development of novel therapeutic agents.
Collapse
Affiliation(s)
- Fengrong Liu
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | | | - Min Yuan
- School of Public Health Administration, Anhui Medical University, Hefei 230032, China.
| |
Collapse
|
3
|
Ye X, Shang Y, Shi T, Zhang W, Sakurai T. Multi-omics clustering for cancer subtyping based on latent subspace learning. Comput Biol Med 2023; 164:107223. [PMID: 37490833 DOI: 10.1016/j.compbiomed.2023.107223] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Revised: 06/07/2023] [Accepted: 06/30/2023] [Indexed: 07/27/2023]
Abstract
The increased availability of high-throughput technologies has enabled biomedical researchers to learn about disease etiology across multiple omics layers, which shows promise for improving cancer subtype identification. Many computational methods have been developed to perform clustering on multi-omics data, however, only a few of them are applicable for partial multi-omics in which some samples lack data in some types of omics. In this study, we propose a novel multi-omics clustering method based on latent sub-space learning (MCLS), which can deal with the missing multi-omics for clustering. We utilize the data with complete omics to construct a latent subspace using PCA-based feature extraction and singular value decomposition (SVD). The data with incomplete multi-omics are then projected to the latent subspace, and spectral clustering is performed to find the clusters. The proposed MCLS method is evaluated on seven different cancer datasets on three levels of omics in both full and partial cases compared to several state-of-the-art methods. The experimental results show that the proposed MCLS method is more efficient and effective than the compared methods for cancer subtype identification in multi-omics data analysis, which provides important references to a comprehensive understanding of cancer and biological mechanisms. AVAILABILITY: The proposed method can be freely accessible at https://github.com/ShangCS/MCLS.
Collapse
Affiliation(s)
- Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, 3058577, Japan; Tsukuba Life Science Innovation Program, University of Tsukuba, Tsukuba, 3058577, Japan.
| | - Yifan Shang
- Department of Computer Science, University of Tsukuba, Tsukuba, 3058577, Japan
| | - Tianyi Shi
- Tsukuba Life Science Innovation Program, University of Tsukuba, Tsukuba, 3058577, Japan
| | - Weihang Zhang
- Department of Computer Science, University of Tsukuba, Tsukuba, 3058577, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba, 3058577, Japan; Tsukuba Life Science Innovation Program, University of Tsukuba, Tsukuba, 3058577, Japan
| |
Collapse
|
4
|
Eastwood M, Marc ST, Gao X, Sailem H, Offman J, Karteris E, Fernandez AM, Jonigk D, Cookson W, Moffatt M, Popat S, Minhas F, Robertus JL. Malignant Mesothelioma subtyping via sampling driven multiple instance prediction on tissue image and cell morphology data. Artif Intell Med 2023; 143:102628. [PMID: 37673586 DOI: 10.1016/j.artmed.2023.102628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 06/30/2023] [Accepted: 07/14/2023] [Indexed: 09/08/2023]
Abstract
Malignant Mesothelioma is a difficult to diagnose and highly lethal cancer usually associated with asbestos exposure. It can be broadly classified into three subtypes: Epithelioid, Sarcomatoid, and a hybrid Biphasic subtype in which significant components of both of the previous subtypes are present. Early diagnosis and identification of the subtype informs treatment and can help improve patient outcome. However, the subtyping of malignant mesothelioma, and specifically the recognition of transitional features from routine histology slides has a high level of inter-observer variability. In this work, we propose an end-to-end multiple instance learning (MIL) approach for malignant mesothelioma subtyping. This uses an adaptive instance-based sampling scheme for training deep convolutional neural networks on bags of image patches that allows learning on a wider range of relevant instances compared to max or top-N based MIL approaches. We also investigate augmenting the instance representation to include aggregate cellular morphology features from cell segmentation. The proposed MIL approach enables identification of malignant mesothelial subtypes of specific tissue regions. From this a continuous characterisation of a sample according to predominance of sarcomatoid vs epithelioid regions is possible, thus avoiding the arbitrary and highly subjective categorisation by currently used subtypes. Instance scoring also enables studying tumor heterogeneity and identifying patterns associated with different subtypes. We have evaluated the proposed method on a dataset of 234 tissue micro-array cores with an AUROC of 0.89±0.05 for this task. The dataset and developed methodology is available for the community at: https://github.com/measty/PINS.
Collapse
Affiliation(s)
- Mark Eastwood
- Tissue Image Analytics Center, University of Warwick, United Kingdom.
| | - Silviu Tudor Marc
- Department of Computer Science, University of Middlesex, United Kingdom
| | - Xiaohong Gao
- Department of Computer Science, University of Middlesex, United Kingdom
| | - Heba Sailem
- Institute of Biomedical Engineering, University of Oxford, United Kingdom; Kings College London, United Kingdom
| | - Judith Offman
- Kings College London, United Kingdom; Wolfson Institute of Population Health, Queen Mary University of London, United Kingdom
| | | | | | - Danny Jonigk
- German Center for Lung Research (DZL), BREATH, Hanover, Germany; Institute of Pathology, Medical Faculty of RWTH Aachen University, Aachen, Germany
| | - William Cookson
- National Heart and Lung Institute, Imperial College London, United Kingdom
| | - Miriam Moffatt
- National Heart and Lung Institute, Imperial College London, United Kingdom
| | - Sanjay Popat
- National Heart and Lung Institute, Imperial College London, United Kingdom
| | - Fayyaz Minhas
- Tissue Image Analytics Center, University of Warwick, United Kingdom
| | - Jan Lukas Robertus
- National Heart and Lung Institute, Imperial College London, United Kingdom
| |
Collapse
|
5
|
Luo J, Feng Y, Wu X, Li R, Shi J, Chang W, Wang J. ForestSubtype: a cancer subtype identifying approach based on high-dimensional genomic data and a parallel random forest. BMC Bioinformatics 2023; 24:289. [PMID: 37468832 DOI: 10.1186/s12859-023-05412-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Accepted: 07/13/2023] [Indexed: 07/21/2023] Open
Abstract
BACKGROUND Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes. RESULTS In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype . CONCLUSIONS Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Yading Feng
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Xuyang Wu
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Ruimin Li
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Jiawei Shi
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Wenjing Chang
- School of Software, Henan Polytechnic University, Jiaozuo, China
| | - Junfeng Wang
- School of Software, Henan Polytechnic University, Jiaozuo, China.
| |
Collapse
|
6
|
Chen Z, Yang Z, Zhu L, Gao P, Matsubara T, Kanaya S, Altaf-Ul-Amin M. Learning vector quantized representation for cancer subtypes identification. Comput Methods Programs Biomed 2023; 236:107543. [PMID: 37100024 DOI: 10.1016/j.cmpb.2023.107543] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 02/13/2023] [Accepted: 04/07/2023] [Indexed: 05/21/2023]
Abstract
BACKGROUND AND OBJECTIVE Defining and separating cancer subtypes is essential for facilitating personalized therapy modality and prognosis of patients. The definition of subtypes has been constantly recalibrated as a result of our deepened understanding. During this recalibration, researchers often rely on clustering of cancer data to provide an intuitive visual reference that could reveal the intrinsic characteristics of subtypes. The data being clustered are often omics data such as transcriptomics that have strong correlations to the underlying biological mechanism. However, while existing studies have shown promising results, they suffer from issues associated with omics data: sample scarcity and high dimensionality while they impose unrealistic assumptions to extract useful features from the data while avoiding overfitting to spurious correlations. METHODS This paper proposes to leverage a recent strong generative model, Vector-Quantized Variational AutoEncoder, to tackle the data issues and extract discrete representations that are crucial to the quality of subsequent clustering by retaining only information relevant to reconstructing the input. RESULTS Extensive experiments and medical analysis on multiple datasets comprising 10 distinct cancers demonstrate the proposed clustering results can significantly and robustly improve prognosis over prevalent subtyping systems. CONCLUSION Our proposal does not impose strict assumptions on data distribution; while, its latent features are better representations of the transcriptomic data in different cancer subtypes, capable of yielding superior clustering performance with any mainstream clustering method.
Collapse
Affiliation(s)
- Zheng Chen
- Graduate School of Engineering Science, Osaka University, Japan.
| | - Ziwei Yang
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan
| | - Lingwei Zhu
- Department of Computing Science, University of Alberta, Canada
| | - Peng Gao
- Institute for Quantitative Biosciences, University of Tokyo, Japan
| | | | - Shigehiko Kanaya
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan; Data Science Center, Nara Insitute of Science and Technology, Japan
| | - Md Altaf-Ul-Amin
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Japan
| |
Collapse
|
7
|
Ye X, Shi T, Cui Y, Sakurai T. Interactive gene identification for cancer subtyping based on multi-omics clustering. Methods 2023; 211:61-67. [PMID: 36804215 DOI: 10.1016/j.ymeth.2023.02.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 02/06/2023] [Accepted: 02/12/2023] [Indexed: 02/17/2023] Open
Abstract
Recent advances in multi-omics databases offer the opportunity to explore complex systems of cancers across hierarchical biological levels. Some methods have been proposed to identify the genes that play a vital role in disease development by integrating multi-omics. However, the existing methods identify the related genes separately, neglecting the gene interactions that are related to the multigenic disease. In this study, we develop a learning framework to identify the interactive genes based on multi-omics data including gene expression. Firstly, we integrate different omics based on their similarities and apply spectral clustering for cancer subtype identification. Then, a gene co-expression network is construct for each cancer subtype. Finally, we detect the interactive genes in the co-expression network by learning the dense subgraphs based on the L1 prosperities of eigenvectors in the modularity matrix. We apply the proposed learning framework on a multi-omics cancer dataset to identify the interactive genes for each cancer subtype. The detected genes are examined by DAVID and KEGG tools for systematic gene ontology enrichment analysis. The analysis results show that the detected genes have relationships to cancer development and the genes in different cancer subtypes are related to different biological processes and pathways, which are expected to yield important references for understanding tumor heterogeneity and improving patient survival.
Collapse
Affiliation(s)
- Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan.
| | - Tianyi Shi
- Tsukuba Life Science Innovation Program, University of Tsukuba, Tsukuba 3058577, Japan
| | - Yaxuan Cui
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba 3058577, Japan; Tsukuba Life Science Innovation Program, University of Tsukuba, Tsukuba 3058577, Japan
| |
Collapse
|
8
|
Gao Z, Hong B, Li Y, Zhang X, Wu J, Wang C, Zhang X, Gong T, Zheng Y, Meng D, Li C. A semi-supervised multi-task learning framework for cancer classification with weak annotation in whole-slide images. Med Image Anal 2023; 83:102652. [PMID: 36327654 DOI: 10.1016/j.media.2022.102652] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Revised: 09/15/2022] [Accepted: 10/08/2022] [Indexed: 11/06/2022]
Abstract
Cancer region detection (CRD) and subtyping are two fundamental tasks in digital pathology image analysis. The development of data-driven models for CRD and subtyping on whole-slide images (WSIs) would mitigate the burden of pathologists and improve their accuracy in diagnosis. However, the existing models are facing two major limitations. Firstly, they typically require large-scale datasets with precise annotations, which contradicts with the original intention of reducing labor effort. Secondly, for the subtyping task, the non-cancerous regions are treated as the same as cancerous regions within a WSI, which confuses a subtyping model in its training process. To tackle the latter limitation, the previous research proposed to perform CRD first for ruling out the non-cancerous region, then train a subtyping model based on the remaining cancerous patches. However, separately training ignores the interaction of these two tasks, also leads to propagating the error of the CRD task to the subtyping task. To address these issues and concurrently improve the performance on both CRD and subtyping tasks, we propose a semi-supervised multi-task learning (MTL) framework for cancer classification. Our framework consists of a backbone feature extractor, two task-specific classifiers, and a weight control mechanism. The backbone feature extractor is shared by two task-specific classifiers, such that the interaction of CRD and subtyping tasks can be captured. The weight control mechanism preserves the sequential relationship of these two tasks and guarantees the error back-propagation from the subtyping task to the CRD task under the MTL framework. We train the overall framework in a semi-supervised setting, where datasets only involve small quantities of annotations produced by our minimal point-based (min-point) annotation strategy. Extensive experiments on four large datasets with different cancer types demonstrate the effectiveness of the proposed framework in both accuracy and generalization.
Collapse
Affiliation(s)
- Zeyu Gao
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China; Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Bangyang Hong
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China; Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Yang Li
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China; Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Xianli Zhang
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China; Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Jialun Wu
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China; Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Chunbao Wang
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China; Department of Pathology, The First Affiliated Hospital of Xi'an Jiaotong University, Xi'an 710061, China
| | - Xiangrong Zhang
- School of Artificial Intelligence, Xidian University, Xi'an 710071, China
| | - Tieliang Gong
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China; Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Yefeng Zheng
- Tencent Jarvis Lab, Shenzhen, Guangdong 518075, China
| | - Deyu Meng
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Chen Li
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049, China; Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
| |
Collapse
|
9
|
Zhang Y, Kiryu H. MODEC: an unsupervised clustering method integrating omics data for identifying cancer subtypes. Brief Bioinform 2022; 23:6696139. [PMID: 36094092 DOI: 10.1093/bib/bbac372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 07/16/2022] [Accepted: 08/08/2022] [Indexed: 12/14/2022] Open
Abstract
The identification of cancer subtypes can help researchers understand hidden genomic mechanisms, enhance diagnostic accuracy and improve clinical treatments. With the development of high-throughput techniques, researchers can access large amounts of data from multiple sources. Because of the high dimensionality and complexity of multiomics and clinical data, research into the integration of multiomics data is needed, and developing effective tools for such purposes remains a challenge for researchers. In this work, we proposed an entirely unsupervised clustering method without harnessing any prior knowledge (MODEC). We used manifold optimization and deep-learning techniques to integrate multiomics data for the identification of cancer subtypes and the analysis of significant clinical variables. Since there is nonlinearity in the gene-level datasets, we used manifold optimization methodology to extract essential information from the original omics data to obtain a low-dimensional latent subspace. Then, MODEC uses a deep learning-based clustering module to iteratively define cluster centroids and assign cluster labels to each sample by minimizing the Kullback-Leibler divergence loss. MODEC was applied to six public cancer datasets from The Cancer Genome Atlas database and outperformed eight competing methods in terms of the accuracy and reliability of the subtyping results. MODEC was extremely competitive in the identification of survival patterns and significant clinical features, which could help doctors monitor disease progression and provide more suitable treatment strategies.
Collapse
Affiliation(s)
- Yanting Zhang
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-0033, Tokyo, Japan
| | - Hisanori Kiryu
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-0033, Tokyo, Japan
| |
Collapse
|
10
|
Ghareyazi A, Kazemi A, Hamidieh K, Dashti H, Tahaei MS, Rabiee HR, Alinejad-Rokny H, Dehzangi I. Pan-cancer integrative analysis of whole-genome De novo somatic point mutations reveals 17 cancer types. BMC Bioinformatics 2022; 23:298. [PMID: 35879674 PMCID: PMC9316662 DOI: 10.1186/s12859-022-04840-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Accepted: 07/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The advent of high throughput sequencing has enabled researchers to systematically evaluate the genetic variations in cancer, identifying many cancer-associated genes. Although cancers in the same tissue are widely categorized in the same group, they demonstrate many differences concerning their mutational profiles. Hence, there is no definitive treatment for most cancer types. This reveals the importance of developing new pipelines to identify cancer-associated genes accurately and re-classify patients with similar mutational profiles. Classification of cancer patients with similar mutational profiles may help discover subtypes of cancer patients who might benefit from specific treatment types. RESULTS In this study, we propose a new machine learning pipeline to identify protein-coding genes mutated in many samples to identify cancer subtypes. We apply our pipeline to 12,270 samples collected from the international cancer genome consortium, covering 19 cancer types. As a result, we identify 17 different cancer subtypes. Comprehensive phenotypic and genotypic analysis indicates distinguishable properties, including unique cancer-related signaling pathways. CONCLUSIONS This new subtyping approach offers a novel opportunity for cancer drug development based on the mutational profile of patients. Additionally, we analyze the mutational signatures for samples in each subtype, which provides important insight into their active molecular mechanisms. Some of the pathways we identified in most subtypes, including the cell cycle and the Axon guidance pathways, are frequently observed in cancer disease. Interestingly, we also identified several mutated genes and different rates of mutation in multiple cancer subtypes. In addition, our study on "gene-motif" suggests the importance of considering both the context of the mutations and mutational processes in identifying cancer-associated genes. The source codes for our proposed clustering pipeline and analysis are publicly available at: https://github.com/bcb-sut/Pan-Cancer .
Collapse
Affiliation(s)
- Amin Ghareyazi
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran
| | - Amirreza Kazemi
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran.,Department of Computer Engineering, Simon Fraser University, Burnaby, BC, 1S6, Canada
| | - Kimia Hamidieh
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 3H2, Canada
| | - Hamed Dashti
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran
| | - Maedeh Sadat Tahaei
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran
| | - Hamid R Rabiee
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran.
| | - Hamid Alinejad-Rokny
- BioMedical Machine Learning Lab (BML), The Graduate School of Biomedical Engineering, UNSW Sydney, Sydney, NSW, 2052, Australia.,UNSW Data Science Hub, The University of New South Wales (UNSW Sydney), Sydney, NSW, 2052, Australia.,AI-Enabled Processes (AIP) Research Centre, Macquarie University, Sydney, 2109, Australia
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, 08102, USA. .,Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, 08102, USA.
| |
Collapse
|
11
|
Madhumita, Paul S. Capturing the latent space of an Autoencoder for multi-omics integration and cancer subtyping. Comput Biol Med 2022; 148:105832. [PMID: 35834966 DOI: 10.1016/j.compbiomed.2022.105832] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 06/15/2022] [Accepted: 07/03/2022] [Indexed: 11/29/2022]
Abstract
BACKGROUND AND OBJECTIVE The motivation behind cancer subtyping is to identify subgroups of cancer patients with distinguishable phenotypes of clinical importance. It can assist in advancement of subtype-targeted based treatments. Subtype identification is a complicated task, therefore requires multi-omics data integration to identify the precise patients' subgroup. Over the years, several computational attempts have been made to identify the cancer subtypes accurately using integrative multi-omics analysis. Some studies have used Autoencoders (AE) to capture multi-omics feature integration in lower dimensions for identifying subtypes in specific types of cancer. However, capturing the highly informative latent space by learning the deep architectures of AE to attain a satisfactory generalized performance is required. Therefore, in this study, a novel AE-assisted cancer subtyping framework is presented that utilizes the compressed latent space of a Sparse AE neural network for multi-omics clustering. METHODS The proposed framework first performs a supervised feature selection based on the survival status of the patients. The selected features from each of the omic data are passed to the AE. The information embedded in the latent space of the trained AE neural networks are then used for cancer subtyping using Spectral clustering. The AE architecture designed in this study exhaustively searches the best compression for multi-omics data by varying the number of neurons in the hidden layers and penalizing activations within the layers. RESULTS AND CONCLUSION The proposed framework is applied to five different multi-omics cancer datasets taken from The Cancer Genome Atlas. It is observed that for getting a robust information bottleneck, a compression of 10-20% of the input features along with an L1 regularization penalty of 0.01 or 0.001 performs well for most of the cancer datasets. Clustering performed on this latent representation generates clusters with better silhouette scores and significantly varying survival patterns. For further biological assessment, differential expression analysis is performed between the identified subtypes of Glioblastoma multiforme (GBM), followed by enrichment analysis of the differentially expressed biomarkers. Several pathways and disease ontology terms coherent to GBM are found to be significantly associated. Varying responses of the identified GBM subtypes towards the drug Temozolomide is also tested to demonstrate its clinical importance. Hence, the study shows that AE-assisted multi-omics integration can be used for the prediction of clinically significant cancer subtypes.
Collapse
Affiliation(s)
- Madhumita
- Department of Bioscience and Bioengineering, Indian Institute of Technology, Jodhpur, 342037, Rajasthan, India.
| | - Sushmita Paul
- Department of Bioscience and Bioengineering, Indian Institute of Technology, Jodhpur, 342037, Rajasthan, India; School of Artificial Intelligence and Data Science, Indian Institute of Technology, Jodhpur, 342037, Rajasthan, India.
| |
Collapse
|
12
|
Song D, Lyu H, Feng Q, Luo J, Li L, Wang X. Subtyping of head and neck squamous cell cancers based on immune signatures. Int Immunopharmacol 2021; 99:108007. [PMID: 34332341 DOI: 10.1016/j.intimp.2021.108007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Revised: 07/14/2021] [Accepted: 07/19/2021] [Indexed: 12/12/2022]
Abstract
Although head and neck squamous cell cancer (HNSCC) is one of the cancer types in which immune checkpoint inhibitors (ICIs) has achieved a certain success, only a subset of HNSCC patients respond to ICIs. Thus, identification of HNSCC subtypes responsive to ICIs is crucial. Using hierarchical clustering, we identified three subtypes of HNSCC, termed Immunity-H, Immunity-M, and Immunity-L, based on the enrichment scores of 28 immune cells generated by the single-sample gene-set enrichment analysis of transcriptome data. We demonstrated that this subtyping method was stable and producible in four different HNSCC cohorts. Immunity-H had the highest levels of immune infiltrates and PD-L1 expression, lowest levels of stemness, intratumor heterogeneity and genomic instability, and favorable prognosis. In contrast, Immunity-L had the lowest levels of immune infiltrates and PD-L1 expression, highest levels of stemness, intratumor heterogeneity and genomic instability, and unfavorable prognosis. We found that somatic copy number alteration had a significant negative association with anti-tumor immunity in HNSCC, while tumor mutation burden showed no significant association. TP53, COL11A1, NSD1, and PKHD1L1 were more frequently mutated in Immunity-H versus Immunity-L, and their mutations were associated with increased immune signatures in HNSCC. Besides immune-related pathways, many stromal and oncogenic pathways were highly enriched in Immunity-H, including cell adhesion molecules, focal adhesion, ECM-receptor interaction, calcium signaling, MAPK signaling, apoptosis, VEGF signaling, and PPAR signaling. The high levels of PD-L1 expression and immune infiltration in Immunity-H indicate that this subtype responds best to ICIs. Our study recaptures the immunological heterogeneity in HNSCC and provide clinical implications for the immunotherapy of HNSCC.
Collapse
Affiliation(s)
- Dandan Song
- Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Cancer Genomics Research Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Big Data Research Institute, China Pharmaceutical University, Nanjing 211198, China
| | - Haoyu Lyu
- Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Cancer Genomics Research Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Big Data Research Institute, China Pharmaceutical University, Nanjing 211198, China
| | - Qiushi Feng
- Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Cancer Genomics Research Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Big Data Research Institute, China Pharmaceutical University, Nanjing 211198, China
| | - Jiangti Luo
- Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Cancer Genomics Research Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Big Data Research Institute, China Pharmaceutical University, Nanjing 211198, China
| | - Lin Li
- Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Cancer Genomics Research Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Big Data Research Institute, China Pharmaceutical University, Nanjing 211198, China
| | - Xiaosheng Wang
- Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Cancer Genomics Research Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, China; Big Data Research Institute, China Pharmaceutical University, Nanjing 211198, China.
| |
Collapse
|
13
|
Wen Y, Song X, Yan B, Yang X, Wu L, Leng D, He S, Bo X. Multi-dimensional data integration algorithm based on random walk with restart. BMC Bioinformatics 2021; 22:97. [PMID: 33639858 PMCID: PMC7912853 DOI: 10.1186/s12859-021-04029-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 02/15/2021] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND The accumulation of various multi-omics data and computational approaches for data integration can accelerate the development of precision medicine. However, the algorithm development for multi-omics data integration remains a pressing challenge. RESULTS Here, we propose a multi-omics data integration algorithm based on random walk with restart (RWR) on multiplex network. We call the resulting methodology Random Walk with Restart for multi-dimensional data Fusion (RWRF). RWRF uses similarity network of samples as the basis for integration. It constructs the similarity network for each data type and then connects corresponding samples of multiple similarity networks to create a multiplex sample network. By applying RWR on the multiplex network, RWRF uses stationary probability distribution to fuse similarity networks. We applied RWRF to The Cancer Genome Atlas (TCGA) data to identify subtypes in different cancer data sets. Three types of data (mRNA expression, DNA methylation, and microRNA expression data) are integrated and network clustering is conducted. Experiment results show that RWRF performs better than single data type analysis and previous integrative methods. CONCLUSIONS RWRF provides powerful support to users to decipher the cancer molecular subtypes, thus may benefit precision treatment of specific patients in clinical practice.
Collapse
Affiliation(s)
- Yuqi Wen
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China
| | - Xinyu Song
- Department of Biomedical Engineering, Chinese PLA General Hospital, Beijing, 100853, People's Republic of China
| | - Bowei Yan
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China
| | - Xiaoxi Yang
- Experimental Center, Beijing Friendship Hospital, Capital Medical University, Beijing, 100069, People's Republic of China
| | - Lianlian Wu
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China.,Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, 300072, People's Republic of China
| | - Dongjin Leng
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China
| | - Song He
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China.
| | - Xiaochen Bo
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing, 100850, People's Republic of China.
| |
Collapse
|
14
|
Zhu X, Shang J, Sun Y, Li F, Liu JX, Yuan S. PSO-CFDP: A Particle Swarm Optimization-Based Automatic Density Peaks Clustering Method for Cancer Subtyping. Hum Hered 2019; 84:9-20. [PMID: 31412348 DOI: 10.1159/000501481] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2019] [Accepted: 06/13/2019] [Indexed: 12/27/2022] Open
Abstract
Cancer subtyping is of great importance for the prediction, diagnosis, and precise treatment of cancer patients. Many clustering methods have been proposed for cancer subtyping. In 2014, a clustering algorithm named Clustering by Fast Search and Find of Density Peaks (CFDP) was proposed and published in Science, which has been applied to cancer subtyping and achieved attractive results. However, CFDP requires to set two key parameters (cluster centers and cutoff distance) manually, while their optimal values are difficult to be determined. To overcome this limitation, an automatic clustering method named PSO-CFDP is proposed in this paper, in which cluster centers and cutoff distance are automatically determined by running an improved particle swarm optimization (PSO) algorithm multiple times. Experiments using PSO-CFDP, as well as LR-CFDP, STClu, CH-CCFDAC, and CFDP, were performed on four benchmark data-sets and two real cancer gene expression datasets. The results show that PSO-CFDP can determine cluster centers and cutoff distance automatically within controllable time/cost and, therefore, improve the accuracy of cancer subtyping.
Collapse
Affiliation(s)
- Xuhui Zhu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China, .,School of Statistics, Qufu Normal University, Qufu, China,
| | - Yan Sun
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Feng Li
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Shasha Yuan
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
15
|
Mallavarapu T, Hao J, Kim Y, Oh JH, Kang M. Pathway-based deep clustering for molecular subtyping of cancer. Methods 2019; 173:24-31. [PMID: 31247294 DOI: 10.1016/j.ymeth.2019.06.017] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Revised: 05/24/2019] [Accepted: 06/16/2019] [Indexed: 12/22/2022] Open
Abstract
Cancer is a genetic disease comprising multiple subtypes that have distinct molecular characteristics and clinical features. Cancer subtyping helps in improving personalized treatment and making decision, as different cancer subtypes respond differently to the treatment. The increasing availability of cancer related genomic data provides the opportunity to identify molecular subtypes. Several unsupervised machine learning techniques have been applied on molecular data of the tumor samples to identify cancer subtypes that are genetically and clinically distinct. However, most clustering methods often fail to efficiently cluster patients due to the challenges imposed by high-throughput genomic data and its non-linearity. In this paper, we propose a pathway-based deep clustering method (PACL) for molecular subtyping of cancer, which incorporates gene expression and biological pathway database to group patients into cancer subtypes. The main contribution of our model is to discover high-level representations of biological data by learning complex hierarchical and nonlinear effects of pathways. We compared the performance of our model with a number of benchmark clustering methods that recently have been proposed in cancer subtypes. We assessed the hypothesis that clusters (subtypes) may be associated to different survivals by logrank tests. PACL showed the lowest p-value of the logrank test against the benchmark methods. It demonstrates the patient groups clustered by PACL may correspond to subtypes which are significantly associated with distinct survival distributions. Moreover, PACL provides a solution to comprehensively identify subtypes and interpret the model in the biological pathway level. The open-source software of PACL in PyTorch is publicly available at https://github.com/tmallava/PACL.
Collapse
Affiliation(s)
| | - Jie Hao
- Analytics and Data Science, Kennesaw State University, Kennesaw, USA.
| | - Youngsoon Kim
- Department of Computer Science, Kennesaw State University, Marietta, USA.
| | - Jung Hun Oh
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, USA.
| | - Mingon Kang
- Analytics and Data Science, Kennesaw State University, Kennesaw, USA; Department of Computer Science, Kennesaw State University, Marietta, USA.
| |
Collapse
|
16
|
Abstract
The application of next-generation sequencing in cancer genomics allowed for a better understanding of the genetics and pathogenesis of cancer. Single-cell genomics is a relatively new field that has enhanced our current knowledge of the genetic diversity of cells involved in the complex biological systems of cancer. Single-cell genomics is a rapidly developing field, and current technologies can assay a single cell's gene expression, DNA variation, epigenetic state, and nuclear structure. Statistical and computational methods are central to single-cell genomics and allows for extraction of meaningful information. The translational application of single-cell sequencing in precision cancer therapy has the potential to improve cancer diagnostics, prognostics, targeted therapy, early detection, and noninvasive monitoring. Furthermore, single-cell genomics will transform cancer research as even initial experiments have revolutionized our current understanding of gene regulation and disease.
Collapse
Affiliation(s)
| | - Pawan Noel
- Molecular Medicine Division, Translational Genomics Research Institute, Phoenix, AZ, USA
| | - Wei Lin
- Molecular Medicine Division, Translational Genomics Research Institute, Phoenix, AZ, USA
| | - Daniel D Von Hoff
- Mayo Clinic, Scottsdale, AZ, USA
- Molecular Medicine Division, Translational Genomics Research Institute, Phoenix, AZ, USA
| | - Haiyong Han
- Molecular Medicine Division, Translational Genomics Research Institute, Phoenix, AZ, USA.
| |
Collapse
|
17
|
Abstract
Immunohistochemistry (IHC) can be applied to diagnostic aspects of pathologic examination to provide aid in assignment of lineage and histologic type of cancer. Increasingly, however, IHC is widely used to provide prognostic and predictive (theranostic) information about the neoplastic disease. A refinement of theranostic application of IHC can be seen in the use of "genomic probing" where antibody staining results are directly correlated with an underlying genetic alteration in the tumor (somatic mutations) and/or the patient (germline constitution). All these aspects of IHC find their best use in guiding the oncologists in the optimal use of therapy for the patients.
Collapse
Affiliation(s)
| | | | - Semir Vranić
- College of Medicine, Qatar University, Doha, Qatar
| | | |
Collapse
|
18
|
Hu T, Chen S, Ullah A, Xue H. AluScanCNV2: An R package for copy number variation calling and cancer risk prediction with next-generation sequencing data. Genes Dis 2019; 6:43-6. [PMID: 30906832 DOI: 10.1016/j.gendis.2018.09.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2018] [Accepted: 09/04/2018] [Indexed: 01/01/2023] Open
Abstract
The usage of next-generation sequencing (NGS) to detect copy number variation (CNV) is widely accepted in cancer research. Based on an AluScanCNV software developed by us previously, an AluScanCNV2 software has been developed in the present study as an R package that performs CNV detection from NGS data obtained through AluScan, whole-genome sequencing or other targeted NGS platforms. Its applications would include the expedited usage of somatic CNVs for cancer subtyping, and usage of recurrent germline CNVs to perform machine learning-assisted prediction of a test subject's susceptibility to cancer.
Collapse
|
19
|
Hu CW, Li H, Qutub AA. Shrinkage Clustering: a fast and size-constrained clustering algorithm for biomedical applications. BMC Bioinformatics 2018; 19:19. [PMID: 29361928 DOI: 10.1186/s12859-018-2022-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2017] [Accepted: 01/10/2018] [Indexed: 12/02/2022] Open
Abstract
Background Many common clustering algorithms require a two-step process that limits their efficiency. The algorithms need to be performed repetitively and need to be implemented together with a model selection criterion. These two steps are needed in order to determine both the number of clusters present in the data and the corresponding cluster memberships. As biomedical datasets increase in size and prevalence, there is a growing need for new methods that are more convenient to implement and are more computationally efficient. In addition, it is often essential to obtain clusters of sufficient sample size to make the clustering result meaningful and interpretable for subsequent analysis. Results We introduce Shrinkage Clustering, a novel clustering algorithm based on matrix factorization that simultaneously finds the optimal number of clusters while partitioning the data. We report its performances across multiple simulated and actual datasets, and demonstrate its strength in accuracy and speed applied to subtyping cancer and brain tissues. In addition, the algorithm offers a straightforward solution to clustering with cluster size constraints. Conclusions Given its ease of implementation, computing efficiency and extensible structure, Shrinkage Clustering can be applied broadly to solve biomedical clustering tasks especially when dealing with large datasets.
Collapse
|
20
|
Speicher NK, Pfeifer N. Towards Multiple Kernel Principal Component Analysis for Integrative Analysis of Tumor Samples. J Integr Bioinform 2017; 14:/j/jib.ahead-of-print/jib-2017-0019/jib-2017-0019.xml. [PMID: 28688226 PMCID: PMC6042822 DOI: 10.1515/jib-2017-0019] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2017] [Accepted: 04/27/2017] [Indexed: 12/03/2022] Open
Abstract
Personalized treatment of patients based on tissue-specific cancer subtypes has strongly increased the efficacy of the chosen therapies. Even though the amount of data measured for cancer patients has increased over the last years, most cancer subtypes are still diagnosed based on individual data sources (e.g. gene expression data). We propose an unsupervised data integration method based on kernel principal component analysis. Principal component analysis is one of the most widely used techniques in data analysis. Unfortunately, the straightforward multiple kernel extension of this method leads to the use of only one of the input matrices, which does not fit the goal of gaining information from all data sources. Therefore, we present a scoring function to determine the impact of each input matrix. The approach enables visualizing the integrated data and subsequent clustering for cancer subtype identification. Due to the nature of the method, no hyperparameters have to be set. We apply the methodology to five different cancer data sets and demonstrate its advantages in terms of results and usability.
Collapse
|