1
|
Li F, Sun Z, Liu JX, Shang J, Dai L, Liu X, Li Y. NESM: a network embedding method for tumor stratification by integrating multi-omics data. G3 GENES|GENOMES|GENETICS 2022; 12:6705238. [PMID: 36124952 PMCID: PMC9635646 DOI: 10.1093/g3journal/jkac243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 08/30/2022] [Indexed: 11/23/2022]
Abstract
Tumor stratification plays an important role in cancer diagnosis and individualized treatment. Recent developments in high-throughput sequencing technologies have produced huge amounts of multi-omics data, making it possible to stratify cancer types using multiple molecular datasets. We introduce a Network Embedding method for tumor Stratification by integrating Multi-omics data. Network Embedding method for tumor Stratification by integrating Multi-omics pregroup the samples, integrate the gene features and somatic mutation corresponding to cancer types within each group to construct patient features, and then integrate all groups to obtain comprehensive patient information. The gene features contain network topology information, because it is extracted by integrating deoxyribonucleic acid methylation, messenger ribonucleic acid expression data, and protein–protein interactions through network embedding method. On the one hand, a supervised learning method Light Gradient Boosting Machine is used to classify cancer types based on patient features. When compared with other 3 methods, Network Embedding method for tumor Stratification by integrating Multi-omics has the highest AUC in most cancer types. The average AUC for stratifying cancer types is 0.91, indicating that the patient features extracted by Network Embedding method for tumor Stratification by integrating Multi-omics are effective for tumor stratification. On the other hand, an unsupervised clustering algorithm Density-Based Spatial Clustering of Applications with Noise is utilized to divide single cancer subtypes. The vast majority of the subtypes identified by Network Embedding method for tumor Stratification by integrating Multi-omics are significantly associated with patient survival.
Collapse
Affiliation(s)
- Feng Li
- School of Computer Science, Qufu Normal University , Rizhao 276826, China
| | - Zhensheng Sun
- School of Computer Science, Qufu Normal University , Rizhao 276826, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University , Rizhao 276826, China
| | - Junliang Shang
- School of Computer Science, Qufu Normal University , Rizhao 276826, China
| | - Lingyun Dai
- School of Computer Science, Qufu Normal University , Rizhao 276826, China
| | - Xikui Liu
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology , Jinan, Shandong 250031, China
| | - Yan Li
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology , Jinan, Shandong 250031, China
| |
Collapse
|
2
|
Yang F, Zhou LQ, Yang HW, Wang YJ. Nine-gene signature and nomogram for predicting survival in patients with head and neck squamous cell carcinoma. Front Genet 2022; 13:927614. [PMID: 36092911 PMCID: PMC9449318 DOI: 10.3389/fgene.2022.927614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 07/25/2022] [Indexed: 12/24/2022] Open
Abstract
Background: Head and neck squamous cell carcinomas (HNSCCs) are derived from the mucosal linings of the upper aerodigestive tract, salivary glands, thyroid, oropharynx, larynx, and hypopharynx. The present study aimed to identify the novel genes and pathways underlying HNSCC. Despite the advances in HNSCC research, diagnosis, and treatment, its incidence continues to rise, and the mortality of advanced HNSCC is expected to increase by 50%. Therefore, there is an urgent need for effective biomarkers to predict HNSCC patients’ prognosis and provide guidance to the personalized treatment.Methods: Both HNSCC clinical and gene expression data were abstracted from The Cancer Genome Atlas (TCGA) database. Intersecting analysis was adopted between the gene expression matrix of HNSCC patients from TCGA database to extract TME-related genes. Differential gene expression analysis between HNSCC tissue samples and normal tissue samples was performed by R software. Then, HNSCC patients were categorized into clusters 1 and 2 via NMF. Next, TME-related prognosis genes (p < 0.05) were analyzed by univariate Cox regression analysis, LASSO Cox regression analysis, and multivariate Cox regression analysis. Finally, nine genes were selected to construct a prognostic risk model and a prognostic gene signature. We also established a nomogram using relevant clinical parameters and a risk score. The Kaplan–Meier curve, survival analysis, time-dependent receiver operating characteristic (ROC) analysis, decision curve analysis (DCA), and the concordance index (C-index) were carried out to assess the accuracy of the prognostic risk model and nomogram. Potential molecular mechanisms were revealed by gene set enrichment analysis (GSEA). Additionally, gene correlation analysis and immune cell correlation analysis were conducted for further enriching our results.Results: A novel HNSCC prognostic model was established based on the nine genes (GTSE1, LRRN4CL, CRYAB, SHOX2, ASNS, KRT23, ANGPT2, HOXA9, and CARD11). The value of area under the ROC curves (AUCs) (0.769, 0.841, and 0.816) in TCGA whole set showed that the model effectively predicted the 1-, 3-, and 5-year overall survival (OS). Results of the Cox regression assessment confirmed the nine-gene signature as a reliable independent prognostic factor in HNSCC patients. The prognostic nomogram developed using multivariate Cox regression analysis showed a superior C-index over other clinical signatures. Also, the calibration curve had a high level of concordance between estimated OS and the observed OS. This showed that its clinical net can precisely estimate the one-, three-, and five-year OS in HNSCC patients. The gene set enrichment analysis (GSEA) to some extent revealed the immune- and tumor-linked cascades.Conclusion: In conclusion, the TME-related nine-gene signature and nomogram can effectively improve the estimation of prognosis in patients with HNSCC.
Collapse
|
3
|
Yuan L, Yang Z, Zhao J, Sun T, Hu C, Shen Z, Yu G. Pan-Cancer Bioinformatics Analysis of Gene UBE2C. Front Genet 2022; 13:893358. [PMID: 35571064 PMCID: PMC9091452 DOI: 10.3389/fgene.2022.893358] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 03/29/2022] [Indexed: 11/30/2022] Open
Abstract
Ubiquitin-Conjugating Enzyme E2 C (UBE2C) is a gene that encodes protein. Disorders associated with UBE2C include methotrexate-related lymphatic hyperplasia and complement component 7 deficiency. The encoded protein is necessary for the destruction of mitotic cell cyclins and cell cycle progression, and may be involved in cancer progression. In this paper, on the basis of public databases, we study the expression differential mechanism of gene expression of UBE2C in various tumors and the performance of prognosis, clinical features, immunity, methylation, etc.
Collapse
Affiliation(s)
- Lin Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Zhenyu Yang
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Jing Zhao
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Tao Sun
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Chunyu Hu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Nanyang, China
| | - Guanying Yu
- Department of Gastrointestinal Surgery, Central Hospital Affiliated to Shandong First Medical University, Jinan, China
- *Correspondence: Guanying Yu,
| |
Collapse
|
4
|
Vangimalla RR, Sreevalsan-Nair J. HCNM: Heterogeneous Correlation Network Model for Multi-level Integrative Study of Multi-omics Data for Cancer Subtype Prediction. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:1880-1886. [PMID: 34891654 DOI: 10.1109/embc46164.2021.9630781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Integrative analysis of multi-omics data is important for biomedical applications, as it is required for a comprehensive understanding of biological function. Integrating multi-omics data serves multiple purposes, such as, an integrated data model, dimensionality reduction of omic features, patient clustering, etc. For oncological data, patient clustering is synonymous to cancer subtype prediction. However, there is a gap in combining some of the widely used integrative analyses to build more powerful tools. To bridge the gap, we propose a multi-level integration algorithm to identify representative integrative subspace and use it for cancer subtype prediction. The three integrative approaches we implement on multi-omics features are, (1) multivariate multiple (linear) regression of the features from a cohort of patients/samples, (2) network construction using different omics features, and (3) fusion of sample similarity networks across the features. We use a type of multilayer network, called heterogeneous network, as a data model to transition between a network-free (NF) regression model and a network-based (NB) model, which uses correlation networks. The heterogeneous networks consist of intra- and inter-layer graphs. Our proposed heterogeneous correlation network model, HCNM, is central to our algorithm for gene-ranking, integrative subspace identification, and tumor-specific subtypes prediction. The genes of our representative integrative subspace have been enriched with gene-ontology and found to exhibit significant gene-disease association (GDA) scores. The subspace in genes which is less than 5% of the total gene-set of each genomic feature is used with NB fusion integrative model to predict sample subtypes. As the identified integrative subspace data of multi-omics is less prone to noise, bias, and outliers, our experiments show that the subtypes in our results agree with previous benchmark studies and exhibit better classification between poor and good survival of patient cohorts.Clinical relevance: Finding significant cancer-specific genes and subtypes of cancer is vital for early prognosis, and personalized treatment; therefore, improves survival probability of a patient.
Collapse
|
5
|
Song J, Peng W, Wang F. Identifying cancer patient subgroups by finding co-modules from the driver mutation profiles and downstream gene expression profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; PP:2863-2872. [PMID: 34415837 DOI: 10.1109/tcbb.2021.3106344] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Identifying cancer subtypes shed new light on effective personalized cancer medicine, future therapeutic strategies and minimizing treatment-related costs. Recently, there are many clustering methods have been proposed in categorizing cancer patients. However, these methods still fail to fully use the prior known biological information in the model designing process to improve precision and efficiency. It is acknowledged that the driver gene always regulates its downstream genes in the net-work to perform a certain function. By analyzing the known clinic cancer subtype data, we found some special co-pathways between the driver genes and the downstream genes in the cancer patients of the same subgroup. Hence, we proposed a novel model named DDCMNMF(Driver and Downstream gene Co-Module Assisted Multiple Non-negative Matrix Factorization model) that first stratify cancer sub-types by identifying co-modules of driver genes and downstream genes. We applied our model on lung and breast cancer datasets and compared it with the other four state-of-the-art models. The final results show that our model could identify the cancer subtypes with high compactness and separateness and achieve a high degree of consistency with the known cancer subtypes. The survival time analysis further proves the significant clinical characteristic of identified cancer subgroups by our model.
Collapse
|
6
|
Yu N, Wu MJ, Liu JX, Zheng CH, Xu Y. Correntropy-Based Hypergraph Regularized NMF for Clustering and Feature Selection on Multi-Cancer Integrated Data. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:3952-3963. [PMID: 32603306 DOI: 10.1109/tcyb.2020.3000799] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Non-negative matrix factorization (NMF) has become one of the most powerful methods for clustering and feature selection. However, the performance of the traditional NMF method severely degrades when the data contain noises and outliers or the manifold structure of the data is not taken into account. In this article, a novel method called correntropy-based hypergraph regularized NMF (CHNMF) is proposed to solve the above problem. Specifically, we use the correntropy instead of the Euclidean norm in the loss term of CHNMF, which will improve the robustness of the algorithm. And the hypergraph regularization term is also applied to the objective function, which can explore the high-order geometric information in more sample points. Then, the half-quadratic (HQ) optimization technique is adopted to solve the complex optimization problem of CHNMF. Finally, extensive experimental results on multi-cancer integrated data indicate that the proposed CHNMF method is superior to other state-of-the-art methods for clustering and feature selection.
Collapse
|
7
|
Yuan L, Sun T, Zhao J, Shen Z. A Novel Computational Framework to Predict Disease-Related Copy Number Variations by Integrating Multiple Data Sources. Front Genet 2021; 12:696956. [PMID: 34267783 PMCID: PMC8276077 DOI: 10.3389/fgene.2021.696956] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Copy number variation (CNV) may contribute to the development of complex diseases. However, due to the complex mechanism of path association and the lack of sufficient samples, understanding the relationship between CNV and cancer remains a major challenge. The unprecedented abundance of CNV, gene, and disease label data provides us with an opportunity to design a new machine learning framework to predict potential disease-related CNVs. In this paper, we developed a novel machine learning approach, namely, IHI-BMLLR (Integrating Heterogeneous Information sources with Biweight Mid-correlation and L1-regularized Logistic Regression under stability selection), to predict the CNV-disease path associations by using a data set containing CNV, disease state labels, and gene data. CNVs, genes, and diseases are connected through edges and then constitute a biological association network. To construct a biological network, we first used a self-adaptive biweight mid-correlation (BM) formula to calculate correlation coefficients between CNVs and genes. Then, we used logistic regression with L1 penalty (LLR) function to detect genes related to disease. We added stability selection strategy, which can effectively reduce false positives, when using self-adaptive BM and LLR. Finally, a weighted path search algorithm was applied to find top D path associations and important CNVs. The experimental results on both simulation and prostate cancer data show that IHI-BMLLR is significantly better than two state-of-the-art CNV detection methods (i.e., CCRET and DPtest) under false-positive control. Furthermore, we applied IHI-BMLLR to prostate cancer data and found significant path associations. Three new cancer-related genes were discovered in the paths, and these genes need to be verified by biological research in the future.
Collapse
Affiliation(s)
- Lin Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Tao Sun
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Jing Zhao
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Nanyang, China
| |
Collapse
|
8
|
Yuan L, Zhao J, Sun T, Shen Z. A machine learning framework that integrates multi-omics data predicts cancer-related LncRNAs. BMC Bioinformatics 2021; 22:332. [PMID: 34134612 PMCID: PMC8210375 DOI: 10.1186/s12859-021-04256-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 06/07/2021] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND LncRNAs (Long non-coding RNAs) are a type of non-coding RNA molecule with transcript length longer than 200 nucleotides. LncRNA has been novel candidate biomarkers in cancer diagnosis and prognosis. However, it is difficult to discover the true association mechanism between lncRNAs and complex diseases. The unprecedented enrichment of multi-omics data and the rapid development of machine learning technology provide us with the opportunity to design a machine learning framework to study the relationship between lncRNAs and complex diseases. RESULTS In this article, we proposed a new machine learning approach, namely LGDLDA (LncRNA-Gene-Disease association networks based LncRNA-Disease Association prediction), for disease-related lncRNAs association prediction based multi-omics data, machine learning methods and neural network neighborhood information aggregation. Firstly, LGDLDA calculates the similarity matrix of lncRNA, gene and disease respectively, and it calculates the similarity between lncRNAs through the lncRNA expression profile matrix, lncRNA-miRNA interaction matrix and lncRNA-protein interaction matrix. We obtain gene similarity matrix by calculating the lncRNA-gene association matrix and the gene-disease association matrix, and we obtain disease similarity matrix by calculating the disease ontology, the disease-miRNA association matrix, and Gaussian interaction profile kernel similarity. Secondly, LGDLDA integrates the neighborhood information in similarity matrices by using nonlinear feature learning of neural network. Thirdly, LGDLDA uses embedded node representations to approximate the observed matrices. Finally, LGDLDA ranks candidate lncRNA-disease pairs and then selects potential disease-related lncRNAs. CONCLUSIONS Compared with lncRNA-disease prediction methods, our proposed method takes into account more critical information and obtains the performance improvement cancer-related lncRNA predictions. Randomly split data experiment results show that the stability of LGDLDA is better than IDHI-MIRW, NCPLDA, LncDisAP and NCPHLDA. The results on different simulation data sets show that LGDLDA can accurately and effectively predict the disease-related lncRNAs. Furthermore, we applied the method to three real cancer data including gastric cancer, colorectal cancer and breast cancer to predict potential cancer-related lncRNAs.
Collapse
Affiliation(s)
- Lin Yuan
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Daxue Road 3501, Jinan, 250353, Shandong, China
| | - Jing Zhao
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Daxue Road 3501, Jinan, 250353, Shandong, China
| | - Tao Sun
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Daxue Road 3501, Jinan, 250353, Shandong, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Changjiang Road 80, Nanyang, 473004, Henan, China.
| |
Collapse
|
9
|
Wang W, Zhou Y, Cheng MT, Wang Y, Zheng CH, Xiong Y, Chen P, Ji Z, Wang B. Potential Pathogenic Genes Prioritization Based on Protein Domain Interaction Network Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1026-1034. [PMID: 32248121 DOI: 10.1109/tcbb.2020.2983894] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Pathogenicity-related studies are of great importance in understanding the pathogenesis of complex diseases and improving the level of clinical medicine. This work proposed a bioinformatics scheme to analyze cancer-related gene mutations, and try to figure out potential genes associated with diseases from the protein domain-domain interaction network. Herein, five measures of the principle of centrality lethality had been adopted to implement potential correlation analysis, and prioritize the significance of genes. This method was further applied to KEGG pathway analysis by taking the malignant melanoma as an example. The experimental results show that 25 domains can be found, and 18 of them have high potential to be pathogenically important related to malignant melanoma. Finally, a web-based tool, named Human Cancer Related Domain Interaction Network Analyzer, is developed for potential pathogenic genes prioritization for 26 types of human cancers, and the analysis results can be visualized and downloaded online.
Collapse
|
10
|
Wang B, Mei C, Wang Y, Zhou Y, Cheng MT, Zheng CH, Wang L, Zhang J, Chen P, Xiong Y. Imbalance Data Processing Strategy for Protein Interaction Sites Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:985-994. [PMID: 31751283 DOI: 10.1109/tcbb.2019.2953908] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein-protein interactions play essential roles in various biological progresses. Identifying protein interaction sites can facilitate researchers to understand life activities and therefore will be helpful for drug design. However, the number of experimental determined protein interaction sites is far less than that of protein sites in protein-protein interaction or protein complexes. Therefore, the negative and positive samples are usually imbalanced, which is common but bring result bias on the prediction of protein interaction sites by computational approaches. In this work, we presented three imbalance data processing strategies to reconstruct the original dataset, and then extracted protein features from the evolutionary conservation of amino acids to build a predictor for identification of protein interaction sites. On a dataset with 10,430 surface residues but only 2,299 interface residues, the imbalance dataset processing strategies can obviously reduce the prediction bias, and therefore improve the prediction performance of protein interaction sites. The experimental results show that our prediction models can achieve a better prediction performance, such as a prediction accuracy of 0.758, or a high F-measure of 0.737, which demonstrated the effectiveness of our method.
Collapse
|
11
|
Wang W, Zhang X, Dai DQ. DeFusion: a denoised network regularization framework for multi-omics integration. Brief Bioinform 2021; 22:6210063. [PMID: 33822879 DOI: 10.1093/bib/bbab057] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 02/03/2021] [Accepted: 01/14/2020] [Indexed: 11/13/2022] Open
Abstract
With diverse types of omics data widely available, many computational methods have been recently developed to integrate these heterogeneous data, providing a comprehensive understanding of diseases and biological mechanisms. But most of them hardly take noise effects into account. Data-specific patterns unique to data types also make it challenging to uncover the consistent patterns and learn a compact representation of multi-omics data. Here we present a multi-omics integration method considering these issues. We explicitly model the error term in data reconstruction and simultaneously consider noise effects and data-specific patterns. We utilize a denoised network regularization in which we build a fused network using a denoising procedure to suppress noise effects and data-specific patterns. The error term collaborates with the denoised network regularization to capture data-specific patterns. We solve the optimization problem via an inexact alternating minimization algorithm. A comparative simulation study shows the method's superiority at discovering common patterns among data types at three noise levels. Transcriptomics-and-epigenomics integration, in seven cancer cohorts from The Cancer Genome Atlas, demonstrates that the learned integrative representation extracted in an unsupervised manner can depict survival information. Specially in liver hepatocellular carcinoma, the learned integrative representation attains average Harrell's C-index of 0.78 in 10 times 3-fold cross-validation for survival prediction, which far exceeds competing methods, and we discover an aggressive subtype in liver hepatocellular carcinoma with this latent representation, which is validated by an external dataset GSE14520. We also show that DeFusion is applicable to the integration of other omics types.
Collapse
Affiliation(s)
- Weiwen Wang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Xiwen Zhang
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| | - Dao-Qing Dai
- Intelligent Data Center, School of Mathematics, Sun Yat-Sen University, Guangzhou, 510275, China
| |
Collapse
|
12
|
Feng J, Jiang L, Li S, Tang J, Wen L. Multi-Omics Data Fusion via a Joint Kernel Learning Model for Cancer Subtype Discovery and Essential Gene Identification. Front Genet 2021; 12:647141. [PMID: 33747053 PMCID: PMC7969795 DOI: 10.3389/fgene.2021.647141] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 02/02/2021] [Indexed: 01/17/2023] Open
Abstract
The multiple sources of cancer determine its multiple causes, and the same cancer can be composed of many different subtypes. Identification of cancer subtypes is a key part of personalized cancer treatment and provides an important reference for clinical diagnosis and treatment. Some studies have shown that there are significant differences in the genetic and epigenetic profiles among different cancer subtypes during carcinogenesis and development. In this study, we first collect seven cancer datasets from the Broad Institute GDAC Firehose, including gene expression profile, isoform expression profile, DNA methylation expression data, and survival information correspondingly. Furthermore, we employ kernel principal component analysis (PCA) to extract features for each expression profile, convert them into three similarity kernel matrices by Gaussian kernel function, and then fuse these matrices as a global kernel matrix. Finally, we apply it to spectral clustering algorithm to get the clustering results of different cancer subtypes. In the experimental results, besides using the P-value from the Cox regression model and survival analysis as the primary evaluation measures, we also introduce statistical indicators such as Rand index (RI) and adjusted RI (ARI) to verify the performance of clustering. Then combining with gene expression profile, we obtain the differential expression of genes among different subtypes by gene set enrichment analysis. For lung cancer, GMPS, EPHA10, C10orf54, and MAGEA6 are highly expressed in different subtypes; for liver cancer, CMYA5, DEPDC6, FAU, VPS24, RCBTB2, LOC100133469, and SLC35B4 are significantly expressed in different subtypes.
Collapse
Affiliation(s)
- Jie Feng
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Shuhao Li
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.,School of Computational Science and Engineering, University of South Carolina, Columbia, SC, United States.,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Lan Wen
- Changsha Municipal Center of Disease Control, Changsha, China
| |
Collapse
|
13
|
Zhao L, Yan H. MCNF: A Novel Method for Cancer Subtyping by Integrating Multi-Omics and Clinical Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1682-1690. [PMID: 30990192 DOI: 10.1109/tcbb.2019.2910515] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In the age of personalized medicine, there is a great need to classify cancer (from the same organ site) into homogeneous subtypes. Recent technology advancements in genome-wide molecular profiling have made it possible to profiling multiple molecular datasets to characterize the genomic changes in various cancer types. How to take full advantage of the availability of these omics data? And how to integrate these molecular data with patient clinical data to do a more systematic subtyping of cancer are the focuses of the paper. We proposed a new method called Molecular and Clinical Networks Fusion (MCNF) to classify cancer into homogeneous subtypes. Our method has two highlights: one is that it can integrate both numerical and non-numerical data into the fused network; the next highlight is that it is unsupervised, which means it can automatically determine the optimal number of clusters.
Collapse
|
14
|
Damgacioglu H, Celik E, Celik N. Intra-Cluster Distance Minimization in DNA Methylation Analysis Using an Advanced Tabu-Based Iterative k-Medoids Clustering Algorithm (T-CLUST). IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1241-1252. [PMID: 30530337 DOI: 10.1109/tcbb.2018.2886006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Recent advances in DNA methylation profiling have paved the way for understanding the underlying epigenetic mechanisms of various diseases such as cancer. While conventional distance-based clustering algorithms (e.g., hierarchical and k-means clustering) have been heavily used in such profiling owing to their speed in conduct of high-throughput analysis, these methods commonly converge to suboptimal solutions and/or trivial clusters due to their greedy search nature. Hence, methodologies are needed to improve the quality of clusters formed by these algorithms without sacrificing from their speed. In this study, we introduce three related algorithms for a complete high-throughput methylation analysis: a variance-based dimension reduction algorithm to handle high-dimensionality in data, an outlier detection algorithm to identify the outliers of data, and an advanced Tabu-based iterative k-medoids clustering algorithm (T-CLUST) to reduce the impact of initial solutions on the performance of conventional k-medoids algorithm. The performance of the proposed algorithms is demonstrated on nine different real DNA methylation datasets obtained from the Gene Expression Omnibus DataSets database. The accuracy of the cluster identification obtained by our proposed algorithms is higher than those of hierarchical and k-means clustering, as well as the conventional methods. The algorithms are implemented in MATLAB, and available at: http://www.coe.miami.edu/simlab/tclust.html.
Collapse
|
15
|
Hu F, Zhou Y, Wang Q, Yang Z, Shi Y, Chi Q. Gene Expression Classification of Lung Adenocarcinoma into Molecular Subtypes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1187-1197. [PMID: 30892233 DOI: 10.1109/tcbb.2019.2905553] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
As one of the most common malignancies in the world, lung adenocarcinoma (LUAD) is currently difficult to cure. However, the advent of precision medicine provides an opportunity to improve the treatment of lung cancer. Subtyping lung cancer plays an important role in performing a specific treatment. Here, we developed a framework that combines k-means clustering, t-test, sensitivity analysis, self-organizing map (SOM) neural network, and hierarchical clustering methods to classify LUAD into four subtypes. We determined that 24 differentially expressed genes could be used as therapeutic targets, and five genes (i.e., RTKN2, ADAM6, SPINK1, COL3A1, and COL1A2) could be potential novel markers for LUAD. Multivariate analysis showed that the four subtypes could serve as prognostic subtypes. Representative genes of each subtype were also identified, which could be potentially targetable markers for the different subtypes. The function and pathway enrichment analyses of these representative genes showed that the four subtypes have different pathological mechanisms. Mutations associated with the subtypes, e.g., epidermal growth factor receptor (EGFR) mutations in subtype 4 and tumor protein p53 (TP53) mutations in subtypes 1 and 2, could serve as potential markers for drug development. The four subtypes provide a foundation for subtype-specific therapy of LUAD.
Collapse
|
16
|
Hou MX, Gao YL, Liu JX, Shang J, Zhu R, Yuan SS. A new method for mining information of co-expression network based on multi-cancers integrated data. BMC Med Genomics 2019; 12:155. [PMID: 31888692 PMCID: PMC6936053 DOI: 10.1186/s12920-019-0608-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 10/23/2019] [Indexed: 12/23/2022] Open
Abstract
Background Gene co-expression network is a favorable method to reveal the nature of disease. With the development of cancer, the way to build gene co-expression networks based on cancer data has been become a hot spot. However, there are still a limited number of current node measurement methods and node mining strategies for multi-cancers network construction. Methods In this paper, we introduce a new method for mining information of co-expression network based on multi-cancers integrated data, named PMN. We construct the network by combining the different types of relevant measures (linear and nonlinear rules) for different nodes based on integrated gene expression data of multi-cancers from The Cancer Genome Atlas (TCGA). For mining genes, we combine different properties (local and global characteristics) of the nodes. Results We uncover more suspicious abnormally expressed genes and shared pathways of different cancers. And we have also found some proven genes and pathways; of course, there are some suspicious factors and molecules that need clinical validation. Conclusions The results demonstrate that our method is very effective in excavating gene co-expression genes of multi-cancers.
Collapse
Affiliation(s)
- Mi-Xiao Hou
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Ying-Lian Gao
- Qufu Normal University Library, Qufu Normal University, Rizhao, China.
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China. .,Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, China.
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Rong Zhu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Sha-Sha Yuan
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
17
|
Wang Y, Mei C, Zhou Y, Wang Y, Zheng C, Zhen X, Xiong Y, Chen P, Zhang J, Wang B. Semi-supervised prediction of protein interaction sites from unlabeled sample information. BMC Bioinformatics 2019; 20:699. [PMID: 31874616 PMCID: PMC6929468 DOI: 10.1186/s12859-019-3274-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Background The recognition of protein interaction sites is of great significance in many biological processes, signaling pathways and drug designs. However, most sites on protein sequences cannot be defined as interface or non-interface sites because only a small part of protein interactions had been identified, which will cause the lack of prediction accuracy and generalization ability of predictors in protein interaction sites prediction. Therefore, it is necessary to effectively improve prediction performance of protein interaction sites using large amounts of unlabeled data together with small amounts of labeled data and background knowledge today. Results In this work, three semi-supervised support vector machine–based methods are proposed to improve the performance in the protein interaction sites prediction, in which the information of unlabeled protein sites can be involved. Herein, five features related with the evolutionary conservation of amino acids are extracted from HSSP database and Consurf Sever, i.e., residue spatial sequence spectrum, residue sequence information entropy and relative entropy, residue sequence conserved weight and residual Base evolution rate, to represent the residues within the protein sequence. Then three predictors are built for identifying the interface residues from protein surface using three types of semi-supervised support vector machine algorithms. Conclusion The experimental results demonstrated that the semi-supervised approaches can effectively improve prediction performance of protein interaction sites when unlabeled information is involved into the predictors and one of them can achieve the best prediction performance, i.e., the accuracy of 70.7%, the sensitivity of 62.67% and the specificity of 78.72%, respectively. With comparison to the existing studies, the semi-supervised models show the improvement of the predication performance.
Collapse
Affiliation(s)
- Ye Wang
- School of Electrical and Information Engineering, Anhui University of Technology, Maanshan, 243002, Anhui, China
| | - Changqing Mei
- School of Electrical and Information Engineering, Anhui University of Technology, Maanshan, 243002, Anhui, China
| | - Yuming Zhou
- School of Electrical and Information Engineering, Anhui University of Technology, Maanshan, 243002, Anhui, China
| | - Yan Wang
- School of Electrical and Information Engineering, Anhui University of Technology, Maanshan, 243002, Anhui, China
| | - Chunhou Zheng
- Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, 230601, Anhui, China
| | - Xiao Zhen
- School of Computer Science and Technology, Anhui University of Technology, Maanshan, 243002, Anhui, China
| | - Yan Xiong
- School of Computer Science and Technology, University of Science & Technology, Hefei, 230026, Anhui, China
| | - Peng Chen
- Institute of Health Sciences, Anhui University, Hefei, 230601, Anhui, China.
| | - Jun Zhang
- College of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China
| | - Bing Wang
- School of Electrical and Information Engineering, Anhui University of Technology, Maanshan, 243002, Anhui, China. .,Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, 230601, Anhui, China.
| |
Collapse
|
18
|
Cui Z, Gao YL, Liu JX, Dai LY, Yuan SS. L 2,1-GRMF: an improved graph regularized matrix factorization method to predict drug-target interactions. BMC Bioinformatics 2019; 20:287. [PMID: 31182006 PMCID: PMC6557743 DOI: 10.1186/s12859-019-2768-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Background Predicting drug-target interactions is time-consuming and expensive. It is important to present the accuracy of the calculation method. There are many algorithms to predict global interactions, some of which use drug-target networks for prediction (ie, a bipartite graph of bound drug pairs and targets known to interact). Although these algorithms can predict some drug-target interactions to some extent, there is little effect for some new drugs or targets that have no known interaction. Results Since the datasets are usually located at or near low-dimensional nonlinear manifolds, we propose an improved GRMF (graph regularized matrix factorization) method to learn these flow patterns in combination with the previous matrix-decomposition method. In addition, we use one of the pre-processing steps previously proposed to improve the accuracy of the prediction. Conclusions Cross-validation is used to evaluate our method, and simulation experiments are used to predict new interactions. In most cases, our method is superior to other methods. Finally, some examples of new drugs and new targets are predicted by performing simulation experiments. And the improved GRMF method can better predict the remaining drug-target interactions.
Collapse
Affiliation(s)
- Zhen Cui
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China. .,Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, China.
| | - Ling-Yun Dai
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Sha-Sha Yuan
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
19
|
Jiang L, Xiao Y, Ding Y, Tang J, Guo F. Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data. Front Genet 2019; 10:20. [PMID: 30804977 PMCID: PMC6370730 DOI: 10.3389/fgene.2019.00020] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 01/15/2019] [Indexed: 01/03/2023] Open
Abstract
Discovering cancer subtypes is useful for guiding clinical treatment of multiple cancers. Progressive profile technologies for tissue have accumulated diverse types of data. Based on these types of expression data, various computational methods have been proposed to predict cancer subtypes. It is crucial to study how to better integrate these multiple profiles of data. In this paper, we collect multiple profiles of data for five cancers on The Cancer Genome Atlas (TCGA). Then, we construct three similarity kernels for all patients of the same cancer by gene expression, miRNA expression and isoform expression data. We also propose a novel unsupervised multiple kernel fusion method, Similarity Kernel Fusion (SKF), in order to integrate three similarity kernels into one combined kernel. Finally, we make use of spectral clustering on the integrated kernel to predict cancer subtypes. In the experimental results, the P-values from the Cox regression model and survival curve analysis can be used to evaluate the performance of predicted subtypes on three datasets. Our kernel fusion method, SKF, has outstanding performance compared with single kernel and other multiple kernel fusion strategies. It demonstrates that our method can accurately identify more accurate subtypes on various kinds of cancers. Our cancer subtype prediction method can identify essential genes and biomarkers for disease diagnosis and prognosis, and we also discuss the possible side effects of therapies and treatment.
Collapse
Affiliation(s)
- Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yongkang Xiao
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
20
|
Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes (Basel) 2019; 10:E87. [PMID: 30696086 PMCID: PMC6410075 DOI: 10.3390/genes10020087] [Citation(s) in RCA: 157] [Impact Index Per Article: 31.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 01/08/2019] [Accepted: 01/21/2019] [Indexed: 12/11/2022] Open
Abstract
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
Collapse
Affiliation(s)
- Bilal Mirza
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Wei Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Jie Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Howard Choi
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Neo Christopher Chung
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland.
| | - Peipei Ping
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Medicine (Cardiology), University of California Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
21
|
Vasudevan P, Murugesan T. Cancer Subtype Discovery Using Prognosis-Enhanced Neural Network Classifier in Multigenomic Data. Technol Cancer Res Treat 2018; 17:1533033818790509. [PMID: 30092720 PMCID: PMC6088521 DOI: 10.1177/1533033818790509] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Objective: The main objective in studying large-scale cancer omics is to identify molecular mechanisms of cancer and discover novel biomedical targets. This work not only discovers the cancer subtypes in genome scale data by using clustering and classification but also measures their accuracy. Methods: Initially, candidate cancer subtypes are recognized by max-flow/min-cut graph clustering. Finally, prognosis-enhanced neural network classifier is proposed for classification. We analyzed the heterogeneity and identified the subtypes of glioblastoma multiforme, an aggressive adult brain tumor, from 215 samples with microRNA expression (12 042 genes). The samples were classified into 4 different classes such as mesenchymal, classical, proneural, and neural subtypes owing to mutations and gene expression. The results are measured using the metrics such as silhouette width, biological stability index, clustering accuracy, precision, recall, and f-measure. Results: Max-flow/min-cut clustering produces higher clustering accuracy of 88.93% for 215 samples. The proposed prognosis-enhanced neural network classifier algorithm produces higher accuracy results of 89.2% for 215 samples efficiently. Conclusion: From the experimental results, the proposed prognosis-enhanced neural network classifier is seen as an alternative, which is full of promise for cancer subtype prediction in genome scale data.
Collapse
Affiliation(s)
| | - Thangamani Murugesan
- 2 Department of Computer Science and Engineering, Kongu Engineering College, Perundurai, Tamilnadu, India
| |
Collapse
|
22
|
Hou MX, Gao YL, Liu JX, Dai LY, Kong XZ, Shang J. Network analysis based on low-rank method for mining information on integrated data of multi-cancers. Comput Biol Chem 2018; 78:468-473. [PMID: 30563751 DOI: 10.1016/j.compbiolchem.2018.11.027] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 11/30/2018] [Accepted: 11/30/2018] [Indexed: 02/01/2023]
Abstract
The noise problem of cancer sequencing data has been a problem that can't be ignored. Utilizing considerable way to reduce noise of these cancer data is an important issue in the analysis of gene co-expression network. In this paper, we apply a sparse and low-rank method which is Robust Principal Component Analysis (RPCA) to solve the noise problem for integrated data of multi-cancers from The Cancer Genome Atlas (TCGA). And then we build the gene co-expression network based on the integrated data after noise reduction. Finally, we perform nodes and pathways mining on the denoising networks. Experiments in this paper show that after denoising by RPCA, the gene expression data tend to be orderly and neat than before, and the constructed networks contain more pathway enrichment information than unprocessed data. Moreover, learning from the betweenness centrality of the nodes in the network, we find some abnormally expressed genes and pathways proven that are associated with many cancers from the denoised network. The experimental results indicate that our method is reasonable and effective, and we also find some candidate suspicious genes that may be linked to multi-cancers.
Collapse
Affiliation(s)
- Mi-Xiao Hou
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China; Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, China.
| | - Ling-Yun Dai
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Xiang-Zhen Kong
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|
23
|
Parimbelli E, Marini S, Sacchi L, Bellazzi R. Patient similarity for precision medicine: A systematic review. J Biomed Inform 2018; 83:87-96. [PMID: 29864490 DOI: 10.1016/j.jbi.2018.06.001] [Citation(s) in RCA: 66] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Revised: 05/16/2018] [Accepted: 06/01/2018] [Indexed: 12/19/2022]
Abstract
Evidence-based medicine is the most prevalent paradigm adopted by physicians. Clinical practice guidelines typically define a set of recommendations together with eligibility criteria that restrict their applicability to a specific group of patients. The ever-growing size and availability of health-related data is currently challenging the broad definitions of guideline-defined patient groups. Precision medicine leverages on genetic, phenotypic, or psychosocial characteristics to provide precise identification of patient subsets for treatment targeting. Defining a patient similarity measure is thus an essential step to allow stratification of patients into clinically-meaningful subgroups. The present review investigates the use of patient similarity as a tool to enable precision medicine. 279 articles were analyzed along four dimensions: data types considered, clinical domains of application, data analysis methods, and translational stage of findings. Cancer-related research employing molecular profiling and standard data analysis techniques such as clustering constitute the majority of the retrieved studies. Chronic and psychiatric diseases follow as the second most represented clinical domains. Interestingly, almost one quarter of the studies analyzed presented a novel methodology, with the most advanced employing data integration strategies and being portable to different clinical domains. Integration of such techniques into decision support systems constitutes and interesting trend for future research.
Collapse
Affiliation(s)
- E Parimbelli
- Telfer School of Management, University of Ottawa, Ottawa, Canada; Interdepartmental Centre for Health Technologies, University of Pavia, Italy.
| | - S Marini
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA; Interdepartmental Centre for Health Technologies, University of Pavia, Italy
| | - L Sacchi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy; Interdepartmental Centre for Health Technologies, University of Pavia, Italy
| | - R Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy; Interdepartmental Centre for Health Technologies, University of Pavia, Italy; RCCS ICS Maugeri, Pavia, Italy
| |
Collapse
|
24
|
Liu J, Wang X, Cheng Y, Zhang L. Tumor gene expression data classification via sample expansion-based deep learning. Oncotarget 2017; 8:109646-109660. [PMID: 29312636 PMCID: PMC5752549 DOI: 10.18632/oncotarget.22762] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 10/29/2017] [Indexed: 12/15/2022] Open
Abstract
Since tumor is seriously harmful to human health, effective diagnosis measures are in urgent need for tumor therapy. Early detection of tumor is particularly important for better treatment of patients. A notable issue is how to effectively discriminate tumor samples from normal ones. Many classification methods, such as Support Vector Machines (SVMs), have been proposed for tumor classification. Recently, deep learning has achieved satisfactory performance in the classification task of many areas. However, the application of deep learning is rare in tumor classification due to insufficient training samples of gene expression data. In this paper, a Sample Expansion method is proposed to address the problem. Inspired by the idea of Denoising Autoencoder (DAE), a large number of samples are obtained by randomly cleaning partially corrupted input many times. The expanded samples can not only maintain the merits of corrupted data in DAE but also deal with the problem of insufficient training samples of gene expression data to a certain extent. Since Stacked Autoencoder (SAE) and Convolutional Neural Network (CNN) models show excellent performance in classification task, the applicability of SAE and 1-dimensional CNN (1DCNN) on gene expression data is analyzed. Finally, two deep learning models, Sample Expansion-Based SAE (SESAE) and Sample Expansion-Based 1DCNN (SE1DCNN), are designed to carry out tumor gene expression data classification by using the expanded samples. Experimental studies indicate that SESAE and SE1DCNN are very effective in tumor classification.
Collapse
Affiliation(s)
- Jian Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Xuesong Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Yuhu Cheng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| |
Collapse
|