1
|
van Lingen HJ, Suarez-Diez M, Saccenti E. Normalization of gene counts affects principal components-based exploratory analysis of RNA-sequencing data. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2024; 1867:195058. [PMID: 39154857 DOI: 10.1016/j.bbagrm.2024.195058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 07/25/2024] [Accepted: 08/09/2024] [Indexed: 08/20/2024]
Abstract
Normalization of gene expression count data is an essential step of in the analysis of RNA-sequencing data. Its statistical analysis has been mostly addressed in the context of differential expression analysis, that is in the univariate setting. However, relationships among genes and samples are better explored and quantified using multivariate exploratory data analysis tools like Principal Component Analysis (PCA). In this study we investigate how normalization impacts the PCA model and its interpretation, considering twelve different widely used normalization methods that were applied on simulated and experimental data. Correlation patterns in the normalized data were explored using both summary statistics and Covariance Simultaneous Component Analysis. The impact of normalization on the PCA solution was assessed by exploring the model complexity, the quality of sample clustering in the low-dimensional PCA space and gene ranking in the model fit to normalized data. PCA models upon normalization were interpreted in the context gene enrichment pathway analysis. We found that although PCA score plots are often similar independently form the normalization used, biological interpretation of the models can depend heavily on the normalization method applied.
Collapse
Affiliation(s)
- Henk J van Lingen
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, the Netherlands
| | - Maria Suarez-Diez
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, the Netherlands
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology, Wageningen University & Research, the Netherlands.
| |
Collapse
|
2
|
Düz E, Çakır T. Effect of RNA-Seq data normalization on protein interactome mapping for Alzheimer's disease. Comput Biol Chem 2024; 109:108028. [PMID: 38377697 DOI: 10.1016/j.compbiolchem.2024.108028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 02/01/2024] [Accepted: 02/04/2024] [Indexed: 02/22/2024]
Abstract
High throughput RNA sequencing brings new perspective to the elucidation of molecular mechanisms of diseases. Normalization is the first and most important step for RNA-Seq data, and it can differ based on the purpose of the analysis. Within-sample normalization methods (eg. TPM) are preferred when genes in a sample are compared with each other, and between-sample normalization methods (eg. deseq2, TMM, Voom) are used when the samples in a dataset are compared. Normalization approaches rescale the data, and, therefore, they affect the results of the analysis. Here, we selected two most commonly used Alzheimer's disease RNA-Seq datasets from ROSMAP and Mayo Clinic cohorts and mapped the differentially expressed genes on human protein interactome to discover disease-specific subnetworks. To this end, the raw count data were first processed with four different, commonly used RNA-Seq normalization methods (deseq2, TMM, Voom and TPM). Then, covariate adjustment was applied to the normalized data for gender, age of death and post-mortem interval. Each normalized dataset was separately mapped on the human protein-protein interaction network either in covariate-adjusted or non-adjusted form. Capturing known Alzheimer's disease genes and genes associated with the disease-related functional terms in the discovered subnetworks were the criteria to compare different normalization methods. Based on our results, applying covariate adjustment has a positive effect on normalization by removing the confounder effects. Covariate-adjusted TMM and covariate-adjusted deseq2 methods performed better in both transcriptome datasets.
Collapse
Affiliation(s)
- Elif Düz
- Department of Bioengineering, Gebze Technical University, Gebze, Kocaeli, 41400, Turkey
| | - Tunahan Çakır
- Department of Bioengineering, Gebze Technical University, Gebze, Kocaeli, 41400, Turkey.
| |
Collapse
|
3
|
O'Connell GC. Variability in donor leukocyte counts confound the use of common RNA sequencing data normalization strategies in transcriptomic biomarker studies performed with whole blood. Sci Rep 2023; 13:15514. [PMID: 37726353 PMCID: PMC10509252 DOI: 10.1038/s41598-023-41443-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 08/26/2023] [Indexed: 09/21/2023] Open
Abstract
Gene expression data generated from whole blood via next generation sequencing is frequently used in studies aimed at identifying mRNA-based biomarker panels with utility for diagnosis or monitoring of human disease. These investigations often employ data normalization techniques more typically used for analysis of data originating from solid tissues, which largely operate under the general assumption that specimens have similar transcriptome composition. However, this assumption may be violated when working with data generated from whole blood, which is more cellularly dynamic, leading to potential confounds. In this study, we used next generation sequencing in combination with flow cytometry to assess the influence of donor leukocyte counts on the transcriptional composition of whole blood specimens sampled from a cohort of 138 human subjects, and then subsequently examined the effect of four frequently used data normalization approaches on our ability to detect inter-specimen biological variance, using the flow cytometry data to benchmark each specimens true cellular and molecular identity. Whole blood samples originating from donors with differing leukocyte counts exhibited dramatic differences in both genome-wide distributions of transcript abundance and gene-level expression patterns. Consequently, three of the normalization strategies we tested, including median ratio (MRN), trimmed mean of m-values (TMM), and quantile normalization, noticeably masked the true biological structure of the data and impaired our ability to detect true interspecimen differences in mRNA levels. The only strategy that improved our ability to detect true biological variance was simple scaling of read counts by sequencing depth, which unlike the aforementioned approaches, makes no assumptions regarding transcriptome composition.
Collapse
Affiliation(s)
- Grant C O'Connell
- Molecular Biomarker Core, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH, 44106-4904, USA.
- School of Nursing, Case Western Reserve University, Cleveland, OH, USA.
| |
Collapse
|
4
|
Livesey M, Rossouw SC, Blignaut R, Christoffels A, Bendou H. Transforming RNA-Seq gene expression to track cancer progression in the multi-stage early to advanced-stage cancer development. PLoS One 2023; 18:e0284458. [PMID: 37093793 PMCID: PMC10124877 DOI: 10.1371/journal.pone.0284458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Accepted: 03/31/2023] [Indexed: 04/25/2023] Open
Abstract
BACKGROUND Cancer progression can be tracked by gene expression changes that occur throughout early-stage to advanced-stage cancer development. The accumulated genetic changes can be detected when gene expression levels in advanced-stage are less variable but show high variability in early-stage. Normalizing advanced-stage expression samples with early-stage and clustering of the normalized expression samples can reveal cancers with similar or different progression and provide insight into clinical and phenotypic patterns of patient samples within the same cancer. OBJECTIVE This study aims to investigate cancer progression through RNA-Seq expression profiles across the multi-stage process of cancer development. METHODS RNA-sequenced gene expression of Diffuse Large B-cell Lymphoma, Lung cancer, Liver cancer, Cervical cancer, and Testicular cancer were downloaded from the UCSC Xena database. Advanced-stage samples were normalized with early-stage samples to consider heterogeneity differences in the multi-stage cancer progression. WGCNA was used to build a gene network and categorized normalized genes into different modules. A gene set enrichment analysis selected key gene modules related to cancer. The diagnostic capacity of the modules was evaluated after hierarchical clustering. RESULTS Unnormalized RNA-Seq gene expression failed to segregate advanced-stage samples based on selected cancer cohorts. Normalization with early-stage revealed the true heterogeneous gene expression that accumulates across the multi-stage cancer progression, this resulted in well segregated cancer samples. Cancer-specific pathways were enriched in the normalized WGCNA modules. The normalization method was further able to stratify patient samples based on phenotypic and clinical information. Additionally, the method allowed for patient survival analysis, with the Cox regression model selecting gene MAP4K1 in cervical cancer and Kaplan-Meier confirming that upregulation is favourable. CONCLUSION The application of the normalization method further enhanced the accuracy of clustering of cancer samples based on how they progressed. Additionally, genes responsible for cancer progression were discovered.
Collapse
Affiliation(s)
- Michelle Livesey
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa
| | - Sophia Catherine Rossouw
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa
| | - Renette Blignaut
- Department of Statistics and Population Studies, University of the Western Cape, Cape Town, South Africa
| | - Alan Christoffels
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa
| | - Hocine Bendou
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa
| |
Collapse
|
5
|
Yang W, Zhao P, Cao P, Miao C, Ji X, Gao Y, Li P, Cheng J. Global interpretation of novel alternative splicing events in human congenital pulmonary airway malformations: A pilot study. J Cell Biochem 2022; 123:736-745. [PMID: 35064685 DOI: 10.1002/jcb.30216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Revised: 01/04/2022] [Accepted: 01/06/2022] [Indexed: 11/08/2022]
Affiliation(s)
- Weili Yang
- Department of Pediatric Surgery The Second Affiliated Hospital of Xi'an Jiaotong University Xi'an Shaanxi China
| | - Pu Zhao
- Department of Neonatology Shaanxi Provincial People's Hospital Xi'an Shaanxi China
| | - Ping Cao
- Department of Pediatric Surgery The Second Affiliated Hospital of Xi'an Jiaotong University Xi'an Shaanxi China
| | - Chunlin Miao
- Department of Pediatric Surgery The Second Affiliated Hospital of Xi'an Jiaotong University Xi'an Shaanxi China
| | - Xiang Ji
- Department of Pediatric Surgery The Second Affiliated Hospital of Xi'an Jiaotong University Xi'an Shaanxi China
| | - Ya Gao
- Department of Pediatric Surgery The Second Affiliated Hospital of Xi'an Jiaotong University Xi'an Shaanxi China
| | - Peng Li
- Department of Pediatric Surgery The Second Affiliated Hospital of Xi'an Jiaotong University Xi'an Shaanxi China
| | - Jiwen Cheng
- Department of Pediatric Surgery The Second Affiliated Hospital of Xi'an Jiaotong University Xi'an Shaanxi China
| |
Collapse
|
6
|
Smail MA, Wu X, Henkel ND, Eby HM, Herman JP, McCullumsmith RE, Shukla R. Similarities and dissimilarities between psychiatric cluster disorders. Mol Psychiatry 2021; 26:4853-4863. [PMID: 33504954 PMCID: PMC8313609 DOI: 10.1038/s41380-021-01030-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/30/2020] [Accepted: 01/12/2021] [Indexed: 01/16/2023]
Abstract
The common molecular mechanisms underlying psychiatric disorders are not well understood. Prior attempts to assess the pathological mechanisms responsible for psychiatric disorders have been limited by biased selection of comparable disorders, datasets/cohort availability, and challenges with data normalization. Here, using DisGeNET, a gene-disease associations database, we sought to expand such investigations in terms of number and types of diseases. In a top-down manner, we analyzed an unbiased cluster of 36 psychiatric disorders and comorbid conditions at biological pathway, cell-type, drug-target, and chromosome levels and deployed density index, a novel metric to quantify similarities (close to 1) and dissimilarities (close to 0) between these disorders at each level. At pathway level, we show that cognition and neurotransmission drive the similarity and are involved across all disorders, whereas immune-system and signal-response coupling (cell surface receptors, signal transduction, gene expression, and metabolic process) drives the dissimilarity and are involved with specific disorders. The analysis at the drug-target level supports the involvement of neurotransmission-related changes across these disorders. At cell-type level, dendrite-targeting interneurons, across all layers, are most involved. Finally, by matching the clustering pattern at each level of analysis, we showed that the similarity between the disorders is influenced most at the chromosomal level and to some extent at the cellular level. Together, these findings provide first insights into distinct cellular and molecular pathologies, druggable mechanisms associated with several psychiatric disorders and comorbid conditions and demonstrate that similarities between these disorders originate at the chromosome level and disperse in a bottom-up manner at cellular and pathway levels.
Collapse
Affiliation(s)
- Marissa A Smail
- Department of Pharmacology and Systems Physiology, University of Cincinnati, Cincinnati, OH, USA
- Neuroscience Graduate Program, University of Cincinnati, Cincinnati, OH, USA
| | - Xiaojun Wu
- Department of Neurosciences, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA
| | - Nicholas D Henkel
- Department of Neurosciences, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA
| | - Hunter M Eby
- Department of Neurosciences, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA
| | - James P Herman
- Department of Pharmacology and Systems Physiology, University of Cincinnati, Cincinnati, OH, USA
- Veterans Affairs Medical Center, Cincinnati, OH, USA
- Department of Neurology, University of Cincinnati, Cincinnati, OH, USA
| | - Robert E McCullumsmith
- Department of Neurosciences, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA
- Neurosciences Institute, ProMedica, Toledo, OH, USA
| | - Rammohan Shukla
- Department of Neurosciences, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA.
| |
Collapse
|
7
|
Lin JK, Chien TW, Wang LY, Chou W. An artificial neural network model to predict the mortality of COVID-19 patients using routine blood samples at the time of hospital admission: Development and validation study. Medicine (Baltimore) 2021; 100:e26532. [PMID: 34260529 PMCID: PMC8284724 DOI: 10.1097/md.0000000000026532] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 06/14/2021] [Accepted: 06/15/2021] [Indexed: 01/08/2023] Open
Abstract
Background: In a pandemic situation (e.g., COVID-19), the most important issue is to select patients at risk of high mortality at an early stage and to provide appropriate treatments. However, a few studies applied the model to predict in-hospital mortality using routine blood samples at the time of hospital admission. This study aimed to develop an app, name predict the mortality of COVID-19 patients (PMCP) app, to predict the mortality of COVID-19 patients at hospital-admission time. Methods: We downloaded patient records from 2 studies, including 361 COVID-19 patients in Wuhan, China, and 106 COVID-19 patients in 3 Korean medical institutions. A total of 30 feature variables were retrieved, consisting of 28 blood biomarkers and 2 demographic variables (i.e., age and gender) of patients. Two models, namely, artificial neural network (ANN) and convolutional neural network (CNN), were compared with each other across 2 scenarios using An app for predicting the mortality of COVID-19 patients was developed using the model's estimated parameters for the prediction and classification of PMCP at an earlier stage. Feature variables and prediction results were visualized using the forest plot and category probability curves shown on Google Maps. Results: We observed that Conclusions: Our new PMCP app with ANN model accurately predicts the mortality probability for COVID-19 patients. It is publicly available and aims to help health care providers fight COVID-19 and improve patients’ classifications against treatment risk.
Collapse
Affiliation(s)
- Ju-Kuo Lin
- Department of Ophthalmology, Chi-Mei Medical Center, Yong Kang, Tainan City, Taiwan
- Department of Optometry, Chung Hwa University of Medical Technology, Jen-Teh, Tainan City, Taiwan
| | - Tsair-Wei Chien
- Department of Medical Research, Chi-Mei Medical Center, Tainan, Taiwan
| | - Lin-Yen Wang
- Department of Pediatrics, Chi-Mei Medical Center, Tainan, Taiwan
- Department of Childhood Education and Nursery, Chia Nan University of Pharmacy and Science, Tainan, Taiwan
- School of Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Willy Chou
- Department of Physical Medicine and Rehabilitation, Chung San Medical University Hospital, Taichung, Taiwan
- Department of Physical Medicine and Rehabilitation, Chi Mei Medical Center, Tainan, Taiwan
| |
Collapse
|
8
|
Li J, Jiang W, Han H, Liu J, Liu B, Wang Y. ScGSLC: An unsupervised graph similarity learning framework for single-cell RNA-seq data clustering. Comput Biol Chem 2020; 90:107415. [PMID: 33307360 DOI: 10.1016/j.compbiolchem.2020.107415] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 09/30/2020] [Accepted: 10/06/2020] [Indexed: 01/18/2023]
Abstract
Accurate clustering of cells from single-cell RNA sequencing (scRNA-seq) data is an essential step for biological analysis such as putative cell type identification. However, scRNA-seq data has high dimension and high sparsity, which makes traditional clustering methods less effective to reflect the similarity between cells. Since genetic network fundamentally defines the functions of cell and deep learning shows strong advantages in network representation learning, we propose a novel scRNA-seq clustering framework ScGSLC based on graph similarity learning. ScGSLC effectively integrates scRNA-seq data and protein-protein interaction network to a graph. Then graph convolution network is employed by ScGSLC to embedding graph and clustering the cells by the calculated similarity between graphs. Unsupervised clustering results of nine public data sets demonstrate that ScGSLC shows better performance than the state-of-the-art methods.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
| | - Wei Jiang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Henry Han
- Department of Computer and Information Science, Fordham University, New York, NY 10023, USA; School of Computer Science, Qinghai Normal University, Xining 810008, China
| | - Jing Liu
- South China Institute for Stem Cell Biology and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, Guangdong 510530, China
| | - Bo Liu
- Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China; Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China.
| |
Collapse
|
9
|
Interpretable Log Contrasts for the Classification of Health Biomarkers: a New Approach to Balance Selection. mSystems 2020; 5:5/2/e00230-19. [PMID: 32265314 PMCID: PMC7141889 DOI: 10.1128/msystems.00230-19] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
High-throughput sequencing provides an easy and cost-effective way to measure the relative abundance of bacteria in any environmental or biological sample. When these samples come from humans, the microbiome signatures can act as biomarkers for disease prediction. However, because bacterial abundance is measured as a composition, the data have unique properties that make conventional analyses inappropriate. To overcome this, analysts often use cumbersome normalizations. This article proposes an alternative method that identifies pairs and trios of bacteria whose stoichiometric presence can differentiate between diseased and nondiseased samples. By using interpretable log contrasts called balances, we developed an entirely normalization-free classification procedure that reduces the feature space and improves the interpretability, without sacrificing classifier performance. Since the turn of the century, technological advances have made it possible to obtain the molecular profile of any tissue in a cost-effective manner. Among these advances are sophisticated high-throughput assays that measure the relative abundances of microorganisms, RNA molecules, and metabolites. While these data are most often collected to gain new insights into biological systems, they can also be used as biomarkers to create clinically useful diagnostic classifiers. How best to classify high-dimensional -omics data remains an area of active research. However, few explicitly model the relative nature of these data and instead rely on cumbersome normalizations. This report (i) emphasizes the relative nature of health biomarkers, (ii) discusses the literature surrounding the classification of relative data, and (iii) benchmarks how different transformations perform for regularized logistic regression across multiple biomarker types. We show how an interpretable set of log contrasts, called balances, can prepare data for classification. We propose a simple procedure, called discriminative balance analysis, to select groups of 2 and 3 bacteria that can together discriminate between experimental conditions. Discriminative balance analysis is a fast, accurate, and interpretable alternative to data normalization. IMPORTANCE High-throughput sequencing provides an easy and cost-effective way to measure the relative abundance of bacteria in any environmental or biological sample. When these samples come from humans, the microbiome signatures can act as biomarkers for disease prediction. However, because bacterial abundance is measured as a composition, the data have unique properties that make conventional analyses inappropriate. To overcome this, analysts often use cumbersome normalizations. This article proposes an alternative method that identifies pairs and trios of bacteria whose stoichiometric presence can differentiate between diseased and nondiseased samples. By using interpretable log contrasts called balances, we developed an entirely normalization-free classification procedure that reduces the feature space and improves the interpretability, without sacrificing classifier performance.
Collapse
|
10
|
Abstract
PURPOSE OF REVIEW Comprehensive analyses of the genome, transcriptome, proteome and metabolome are instrumental in identifying biomarkers of disease, to gain insight into mechanisms underlying the development of cardiovascular disease, and show promise for better stratifying patients according to disease subtypes. This review highlights recent 'omics' studies, including integration of multiple 'omics' that have advanced mechanistic understanding and diagnosis in humans and animal models. RECENT FINDINGS Transcriptome-based discovery continues to be a primary method to obtain data for hypothesis generation and the understanding of disease pathogenesis has been enhanced by single cell-based methods capable of revealing heterogeneity in cellular responses. Advances in proteome coverage and quantitation of individual protein species, together with enhanced methods for detecting posttranslational modifications, have improved discovery of protein-based biomarkers. SUMMARY High-throughput assays capable of quantitating the vast majority of any particular type of biomolecule within a tissue sample, isolated cells or plasma are now available. In order to make best use of the large amount of data that can be generated on given molecule types, as well as their interrelationships in disease, continued development of pattern-recognition algorithms ('machine learning') will be required and the subclassification of disease that is made possible by such algorithms will be likely to inform clinical practice, and vice versa.
Collapse
|
11
|
Hou MX, Gao YL, Liu JX, Dai LY, Kong XZ, Shang J. Network analysis based on low-rank method for mining information on integrated data of multi-cancers. Comput Biol Chem 2018; 78:468-473. [PMID: 30563751 DOI: 10.1016/j.compbiolchem.2018.11.027] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 11/30/2018] [Accepted: 11/30/2018] [Indexed: 02/01/2023]
Abstract
The noise problem of cancer sequencing data has been a problem that can't be ignored. Utilizing considerable way to reduce noise of these cancer data is an important issue in the analysis of gene co-expression network. In this paper, we apply a sparse and low-rank method which is Robust Principal Component Analysis (RPCA) to solve the noise problem for integrated data of multi-cancers from The Cancer Genome Atlas (TCGA). And then we build the gene co-expression network based on the integrated data after noise reduction. Finally, we perform nodes and pathways mining on the denoising networks. Experiments in this paper show that after denoising by RPCA, the gene expression data tend to be orderly and neat than before, and the constructed networks contain more pathway enrichment information than unprocessed data. Moreover, learning from the betweenness centrality of the nodes in the network, we find some abnormally expressed genes and pathways proven that are associated with many cancers from the denoised network. The experimental results indicate that our method is reasonable and effective, and we also find some candidate suspicious genes that may be linked to multi-cancers.
Collapse
Affiliation(s)
- Mi-Xiao Hou
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China; Co-Innovation Center for Information Supply & Assurance Technology, Anhui University, Hefei, China.
| | - Ling-Yun Dai
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Xiang-Zhen Kong
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| |
Collapse
|