1
|
Bornhofen E, Fè D, Nagy I, Lenk I, Greve M, Didion T, Jensen CS, Asp T, Janss L. Genetic architecture of inter-specific and -generic grass hybrids by network analysis on multi-omics data. BMC Genomics 2023; 24:213. [PMID: 37095447 PMCID: PMC10127077 DOI: 10.1186/s12864-023-09292-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 04/02/2023] [Indexed: 04/26/2023] Open
Abstract
BACKGROUND Understanding the mechanisms underlining forage production and its biomass nutritive quality at the omics level is crucial for boosting the output of high-quality dry matter per unit of land. Despite the advent of multiple omics integration for the study of biological systems in major crops, investigations on forage species are still scarce. RESULTS Our results identified substantial changes in gene co-expression and metabolite-metabolite network topologies as a result of genetic perturbation by hybridizing L. perenne with another species within the genus (L. multiflorum) relative to across genera (F. pratensis). However, conserved hub genes and hub metabolomic features were detected between pedigree classes, some of which were highly heritable and displayed one or more significant edges with agronomic traits in a weighted omics-phenotype network. In spite of tagging relevant biological molecules as, for example, the light-induced rice 1 (LIR1), hub features were not necessarily better explanatory variables for omics-assisted prediction than features stochastically sampled and all available regressors. CONCLUSIONS The utilization of computational techniques for the reconstruction of co-expression networks facilitates the identification of key omic features that serve as central nodes and demonstrate correlation with the manifestation of observed traits. Our results also indicate a robust association between early multi-omic traits measured in a greenhouse setting and phenotypic traits evaluated under field conditions.
Collapse
Affiliation(s)
- Elesandro Bornhofen
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark.
| | - Dario Fè
- Research Division, DLF Seeds A/S, Store Heddinge, Denmark
| | - Istvan Nagy
- Center for Quantitative Genetics and Genomics, Aarhus University, Slagelse, Denmark
| | - Ingo Lenk
- Research Division, DLF Seeds A/S, Store Heddinge, Denmark
| | - Morten Greve
- Research Division, DLF Seeds A/S, Store Heddinge, Denmark
| | - Thomas Didion
- Research Division, DLF Seeds A/S, Store Heddinge, Denmark
| | | | - Torben Asp
- Center for Quantitative Genetics and Genomics, Aarhus University, Slagelse, Denmark
| | - Luc Janss
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark.
| |
Collapse
|
2
|
Zhang XF, Ou-Yang L, Yan T, Hu XT, Yan H. A Joint Graphical Model for Inferring Gene Networks Across Multiple Subpopulations and Data Types. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:1043-1055. [PMID: 31794418 DOI: 10.1109/tcyb.2019.2952711] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Reconstructing gene networks from gene expression data is a long-standing challenge. In most applications, the observations can be divided into several distinct but related subpopulations and the gene expression measurements can be collected from multiple data types. Most existing methods are designed to estimate a single gene network from a single dataset. These methods may be suboptimal since they do not exploit the similarities and differences among different subpopulations and data types. In this article, we propose a joint graphical model to estimate the multiple gene networks simultaneously. Our model decomposes each subpopulation-specific gene network as a sum of common and unique components and imposes a group lasso penalty on gene networks corresponding to different data types. The gene network variations across subpopulations can be learned automatically by the decompositions of networks, and the similarities and differences among data types can be captured by the group lasso penalty. The simulation studies demonstrate that our method outperforms the state-of-the-art methods. We also apply our method to the cancer genome atlas breast cancer datasets to reconstruct subtype-specific gene networks. Hub nodes in the estimated subnetworks unique to individual cancer subtypes rediscover well-known genes associated with breast cancer subtypes and provide interesting predictions.
Collapse
|
3
|
Tu JJ, Ou-Yang L, Yan H, Zhang XF, Qin H. Joint reconstruction of multiple gene networks by simultaneously capturing inter-tumor and intra-tumor heterogeneity. Bioinformatics 2020; 36:2755-2762. [PMID: 31971577 DOI: 10.1093/bioinformatics/btaa014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Revised: 12/22/2019] [Accepted: 01/18/2020] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Reconstruction of cancer gene networks from gene expression data is important for understanding the mechanisms underlying human cancer. Due to heterogeneity, the tumor tissue samples for a single cancer type can be divided into multiple distinct subtypes (inter-tumor heterogeneity) and are composed of non-cancerous and cancerous cells (intra-tumor heterogeneity). If tumor heterogeneity is ignored when inferring gene networks, the edges specific to individual cancer subtypes and cell types cannot be characterized. However, most existing network reconstruction methods do not simultaneously take inter-tumor and intra-tumor heterogeneity into account. RESULTS In this article, we propose a new Gaussian graphical model-based method for jointly estimating multiple cancer gene networks by simultaneously capturing inter-tumor and intra-tumor heterogeneity. Given gene expression data of heterogeneous samples for different cancer subtypes, a non-cancerous network shared across different cancer subtypes and multiple subtype-specific cancerous networks are estimated jointly. Tumor heterogeneity can be revealed by the difference in the estimated networks. The performance of our method is first evaluated using simulated data, and the results indicate that our method outperforms other state-of-the-art methods. We also apply our method to The Cancer Genome Atlas breast cancer data to reconstruct non-cancerous and subtype-specific cancerous gene networks. Hub nodes in the networks estimated by our method perform important biological functions associated with breast cancer development and subtype classification. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/Zhangxf-ccnu/NETI2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jia-Juan Tu
- Department of Statistics, Hubei Key Laboratory of Mathematical Sciences, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen 518060, China
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong 999077, China
| | - Xiao-Fei Zhang
- Department of Statistics, Hubei Key Laboratory of Mathematical Sciences, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai 200433, China
| | - Hong Qin
- Department of Statistics, Hubei Key Laboratory of Mathematical Sciences, School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China.,Department of Statistics, Zhongnan University of Economics and Law, Wuhan 430073, China
| |
Collapse
|
4
|
Yuan R, Ou-Yang L, Hu X, Zhang XF. Identifying Gene Network Rewiring Using Robust Differential Graphical Model with Multivariate t-Distribution. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:712-718. [PMID: 30802872 DOI: 10.1109/tcbb.2019.2901473] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Identifying gene network rewiring under different biological conditions is important for understanding the mechanisms underlying complex diseases. Gaussian graphical models, which assume the data follow the multivariate normal distribution, are widely used to identify gene network rewiring. However, the normality assume often fails in reality since the data are contaminated by extreme outliers in general. In this study, we propose a new robust differential graphical model to identify gene network rewiring between two conditions based on the multivariate t-distribution. The multivariate t-distribution is more robust to outliers than the normal distribution since it has heavy tails and allows values far from the mean. A fused lasso penalty is used to borrow information across conditions to improve the results. We develop an expectation maximization algorithm to solve the optimization model. Experiment results on simulated data show that our method outperforms the state-of-the-art methods. Our method is also applied to identify gene network rewiring between luminal A and basal-like subtypes of breast cancer, and gene network rewiring between the proneural and mesenchymal subtypes of glioblastoma. Several key genes which drive gene network rewiring are discovered.
Collapse
|
5
|
Mallavarapu T, Hao J, Kim Y, Oh JH, Kang M. Pathway-based deep clustering for molecular subtyping of cancer. Methods 2019; 173:24-31. [PMID: 31247294 DOI: 10.1016/j.ymeth.2019.06.017] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Revised: 05/24/2019] [Accepted: 06/16/2019] [Indexed: 12/22/2022] Open
Abstract
Cancer is a genetic disease comprising multiple subtypes that have distinct molecular characteristics and clinical features. Cancer subtyping helps in improving personalized treatment and making decision, as different cancer subtypes respond differently to the treatment. The increasing availability of cancer related genomic data provides the opportunity to identify molecular subtypes. Several unsupervised machine learning techniques have been applied on molecular data of the tumor samples to identify cancer subtypes that are genetically and clinically distinct. However, most clustering methods often fail to efficiently cluster patients due to the challenges imposed by high-throughput genomic data and its non-linearity. In this paper, we propose a pathway-based deep clustering method (PACL) for molecular subtyping of cancer, which incorporates gene expression and biological pathway database to group patients into cancer subtypes. The main contribution of our model is to discover high-level representations of biological data by learning complex hierarchical and nonlinear effects of pathways. We compared the performance of our model with a number of benchmark clustering methods that recently have been proposed in cancer subtypes. We assessed the hypothesis that clusters (subtypes) may be associated to different survivals by logrank tests. PACL showed the lowest p-value of the logrank test against the benchmark methods. It demonstrates the patient groups clustered by PACL may correspond to subtypes which are significantly associated with distinct survival distributions. Moreover, PACL provides a solution to comprehensively identify subtypes and interpret the model in the biological pathway level. The open-source software of PACL in PyTorch is publicly available at https://github.com/tmallava/PACL.
Collapse
Affiliation(s)
| | - Jie Hao
- Analytics and Data Science, Kennesaw State University, Kennesaw, USA.
| | - Youngsoon Kim
- Department of Computer Science, Kennesaw State University, Marietta, USA.
| | - Jung Hun Oh
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, USA.
| | - Mingon Kang
- Analytics and Data Science, Kennesaw State University, Kennesaw, USA; Department of Computer Science, Kennesaw State University, Marietta, USA.
| |
Collapse
|
6
|
Vasudevan P, Murugesan T. Cancer Subtype Discovery Using Prognosis-Enhanced Neural Network Classifier in Multigenomic Data. Technol Cancer Res Treat 2018; 17:1533033818790509. [PMID: 30092720 PMCID: PMC6088521 DOI: 10.1177/1533033818790509] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Objective: The main objective in studying large-scale cancer omics is to identify molecular mechanisms of cancer and discover novel biomedical targets. This work not only discovers the cancer subtypes in genome scale data by using clustering and classification but also measures their accuracy. Methods: Initially, candidate cancer subtypes are recognized by max-flow/min-cut graph clustering. Finally, prognosis-enhanced neural network classifier is proposed for classification. We analyzed the heterogeneity and identified the subtypes of glioblastoma multiforme, an aggressive adult brain tumor, from 215 samples with microRNA expression (12 042 genes). The samples were classified into 4 different classes such as mesenchymal, classical, proneural, and neural subtypes owing to mutations and gene expression. The results are measured using the metrics such as silhouette width, biological stability index, clustering accuracy, precision, recall, and f-measure. Results: Max-flow/min-cut clustering produces higher clustering accuracy of 88.93% for 215 samples. The proposed prognosis-enhanced neural network classifier algorithm produces higher accuracy results of 89.2% for 215 samples efficiently. Conclusion: From the experimental results, the proposed prognosis-enhanced neural network classifier is seen as an alternative, which is full of promise for cancer subtype prediction in genome scale data.
Collapse
Affiliation(s)
| | - Thangamani Murugesan
- 2 Department of Computer Science and Engineering, Kongu Engineering College, Perundurai, Tamilnadu, India
| |
Collapse
|
7
|
Joint L1/2-Norm Constraint and Graph-Laplacian PCA Method for Feature Extraction. BIOMED RESEARCH INTERNATIONAL 2017; 2017:5073427. [PMID: 28470011 PMCID: PMC5392409 DOI: 10.1155/2017/5073427] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2016] [Revised: 02/12/2017] [Accepted: 03/01/2017] [Indexed: 01/05/2023]
Abstract
Principal Component Analysis (PCA) as a tool for dimensionality reduction is widely used in many areas. In the area of bioinformatics, each involved variable corresponds to a specific gene. In order to improve the robustness of PCA-based method, this paper proposes a novel graph-Laplacian PCA algorithm by adopting L1/2 constraint (L1/2 gLPCA) on error function for feature (gene) extraction. The error function based on L1/2-norm helps to reduce the influence of outliers and noise. Augmented Lagrange Multipliers (ALM) method is applied to solve the subproblem. This method gets better results in feature extraction than other state-of-the-art PCA-based methods. Extensive experimental results on simulation data and gene expression data sets demonstrate that our method can get higher identification accuracies than others.
Collapse
|
8
|
Disease biomarker identification from gene network modules for metastasized breast cancer. Sci Rep 2017; 7:1072. [PMID: 28432361 PMCID: PMC5430701 DOI: 10.1038/s41598-017-00996-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Accepted: 03/21/2017] [Indexed: 12/13/2022] Open
Abstract
Advancement in science has tended to improve treatment of fatal diseases such as cancer. A major concern in the area is the spread of cancerous cells, technically refered to as metastasis into other organs beyond the primary organ. Treatment in such a stage of cancer is extremely difficult and usually palliative only. In this study, we focus on finding gene-gene network modules which are functionally similar in nature in the case of breast cancer. These modules extracted during the disease progression stages are analyzed using p-value and their associated pathways. We also explore interesting patterns associated with the causal genes, viz., SCGB1D2, MET, CYP1B1 and MMP9 in terms of expression similarity and pathway contexts. We analyze the genes involved in both the stages- non metastasis and metastatsis and change in their expression values, their associated pathways and roles as the disease progresses from one stage to another. We discover three additional pathways viz., Glycerophospholipid metablism, h-Efp pathway and CARM1 and Regulation of Estrogen Receptor, which can be related to the metastasis phase of breast cancer. These new pathways can be further explored to identify their relevance during the progression of the disease.
Collapse
|
9
|
Wang D, Liu JX, Gao YL, Zheng CH, Xu Y. Characteristic Gene Selection Based on Robust Graph Regularized Non-Negative Matrix Factorization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:1059-1067. [PMID: 26672047 DOI: 10.1109/tcbb.2015.2505294] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Many methods have been considered for gene selection and analysis of gene expression data. Nonetheless, there still exists the considerable space for improving the explicitness and reliability of gene selection. To this end, this paper proposes a novel method named robust graph regularized non-negative matrix factorization for characteristic gene selection using gene expression data, which mainly contains two aspects: Firstly, enforcing L21-norm minimization on error function which is robust to outliers and noises in data points. Secondly, it considers that the samples lie in low-dimensional manifold which embeds in a high-dimensional ambient space, and reveals the data geometric structure embedded in the original data. To demonstrate the validity of the proposed method, we apply it to gene expression data sets involving various human normal and tumor tissue samples and the results demonstrate that the method is effective and feasible.
Collapse
|
10
|
Differential network analysis from cross-platform gene expression data. Sci Rep 2016; 6:34112. [PMID: 27677586 PMCID: PMC5039701 DOI: 10.1038/srep34112] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2016] [Accepted: 09/07/2016] [Indexed: 01/18/2023] Open
Abstract
Understanding how the structure of gene dependency network changes between two patient-specific groups is an important task for genomic research. Although many computational approaches have been proposed to undertake this task, most of them estimate correlation networks from group-specific gene expression data independently without considering the common structure shared between different groups. In addition, with the development of high-throughput technologies, we can collect gene expression profiles of same patients from multiple platforms. Therefore, inferring differential networks by considering cross-platform gene expression profiles will improve the reliability of network inference. We introduce a two dimensional joint graphical lasso (TDJGL) model to simultaneously estimate group-specific gene dependency networks from gene expression profiles collected from different platforms and infer differential networks. TDJGL can borrow strength across different patient groups and data platforms to improve the accuracy of estimated networks. Simulation studies demonstrate that TDJGL provides more accurate estimates of gene networks and differential networks than previous competing approaches. We apply TDJGL to the PI3K/AKT/mTOR pathway in ovarian tumors to build differential networks associated with platinum resistance. The hub genes of our inferred differential networks are significantly enriched with known platinum resistance-related genes and include potential platinum resistance-related genes.
Collapse
|
11
|
An NMF-L2,1-Norm Constraint Method for Characteristic Gene Selection. PLoS One 2016; 11:e0158494. [PMID: 27428058 PMCID: PMC4948826 DOI: 10.1371/journal.pone.0158494] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2015] [Accepted: 06/16/2016] [Indexed: 11/30/2022] Open
Abstract
Recent research has demonstrated that characteristic gene selection based on gene expression data remains faced with considerable challenges. This is primarily because gene expression data are typically high dimensional, negative, non-sparse and noisy. However, existing methods for data analysis are able to cope with only some of these challenges. In this paper, we address all of these challenges with a unified method: nonnegative matrix factorization via the L2,1-norm (NMF-L2,1). While L2,1-norm minimization is applied to both the error function and the regularization term, our method is robust to outliers and noise in the data and generates sparse results. The application of our method to plant and tumor gene expression data demonstrates that NMF-L2,1 can extract more characteristic genes than other existing state-of-the-art methods.
Collapse
|
12
|
Gao C, Zhu Y, Shen X, Pan W. Estimation of multiple networks in Gaussian mixture models. Electron J Stat 2016; 10:1133-1154. [PMID: 28966702 PMCID: PMC5620020 DOI: 10.1214/16-ejs1135] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
We aim to estimate multiple networks in the presence of sample heterogeneity, where the independent samples (i.e. observations) may come from different and unknown populations or distributions. Specifically, we consider penalized estimation of multiple precision matrices in the framework of a Gaussian mixture model. A major innovation is to take advantage of the commonalities across the multiple precision matrices through possibly nonconvex fusion regularization, which for example makes it possible to achieve simultaneous discovery of unknown disease subtypes and detection of differential gene (dys)regulations in functional genomics. We embed in the EM algorithm one of two recently proposed methods for estimating multiple precision matrices in Gaussian graphical models. We demonstrate the feasibility and potential usefulness of the proposed methods in an application to glioblastoma subtype discovery and differential gene network analysis with a microarray gene expression data set. We also conduct realistic simulation studies to evaluate and compare the performance of various methods.
Collapse
Affiliation(s)
- Chen Gao
- Division of Biostatistics, School of Public Health, University of Minnesota
| | | | | | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota
| |
Collapse
|
13
|
Acharya S, Saha S. Importance of proximity measures in clustering of cancer and miRNA datasets: proposal of an automated framework. MOLECULAR BIOSYSTEMS 2016; 12:3478-3501. [DOI: 10.1039/c6mb00609d] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Distance plays an important role in the clustering process for allocating data points to different clusters.
Collapse
Affiliation(s)
- Sudipta Acharya
- Department of Computer Science and Engineering
- Indian Institute of Technology Patna
- India
| | - Sriparna Saha
- Department of Computer Science and Engineering
- Indian Institute of Technology Patna
- India
| |
Collapse
|
14
|
Al-Harazi O, Al Insaif S, Al-Ajlan MA, Kaya N, Dzimiri N, Colak D. Integrated Genomic and Network-Based Analyses of Complex Diseases and Human Disease Network. J Genet Genomics 2015; 43:349-67. [PMID: 27318646 DOI: 10.1016/j.jgg.2015.11.002] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2015] [Revised: 10/22/2015] [Accepted: 11/20/2015] [Indexed: 12/16/2022]
Abstract
A disease phenotype generally reflects various pathobiological processes that interact in a complex network. The highly interconnected nature of the human protein interaction network (interactome) indicates that, at the molecular level, it is difficult to consider diseases as being independent of one another. Recently, genome-wide molecular measurements, data mining and bioinformatics approaches have provided the means to explore human diseases from a molecular basis. The exploration of diseases and a system of disease relationships based on the integration of genome-wide molecular data with the human interactome could offer a powerful perspective for understanding the molecular architecture of diseases. Recently, subnetwork markers have proven to be more robust and reliable than individual biomarker genes selected based on gene expression profiles alone, and achieve higher accuracy in disease classification. We have applied one of these methodologies to idiopathic dilated cardiomyopathy (IDCM) data that we have generated using a microarray and identified significant subnetworks associated with the disease. In this paper, we review the recent endeavours in this direction, and summarize the existing methodologies and computational tools for network-based analysis of complex diseases and molecular relationships among apparently different disorders and human disease network. We also discuss the future research trends and topics of this promising field.
Collapse
Affiliation(s)
- Olfat Al-Harazi
- Department of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Saudi Arabia
| | - Sadiq Al Insaif
- Department of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Saudi Arabia
| | - Monirah A Al-Ajlan
- Department of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Saudi Arabia; College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia
| | - Namik Kaya
- Department of Genetics, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Saudi Arabia
| | - Nduna Dzimiri
- Department of Genetics, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Saudi Arabia
| | - Dilek Colak
- Department of Biostatistics, Epidemiology and Scientific Computing, King Faisal Specialist Hospital and Research Centre, Riyadh 11211, Saudi Arabia.
| |
Collapse
|
15
|
Liu J, Liu JX, Gao YL, Kong XZ, Wang XS, Wang D. A P-Norm Robust Feature Extraction Method for Identifying Differentially Expressed Genes. PLoS One 2015. [PMID: 26201006 PMCID: PMC4511795 DOI: 10.1371/journal.pone.0133124] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
In current molecular biology, it becomes more and more important to identify differentially expressed genes closely correlated with a key biological process from gene expression data. In this paper, based on the Schatten p-norm and Lp-norm, a novel p-norm robust feature extraction method is proposed to identify the differentially expressed genes. In our method, the Schatten p-norm is used as the regularization function to obtain a low-rank matrix and the Lp-norm is taken as the error function to improve the robustness to outliers in the gene expression data. The results on simulation data show that our method can obtain higher identification accuracies than the competitive methods. Numerous experiments on real gene expression data sets demonstrate that our method can identify more differentially expressed genes than the others. Moreover, we confirmed that the identified genes are closely correlated with the corresponding gene expression data.
Collapse
Affiliation(s)
- Jian Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, Shandong, China
- School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221000, Jiangsu, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, Shandong, China
- Bio-Computing Research Center, Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, 518055, Guangdong, China
- * E-mail:
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, 276826, Shandong, China
| | - Xiang-Zhen Kong
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, Shandong, China
| | - Xue-Song Wang
- School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221000, Jiangsu, China
- The Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation Chinese Academy of Sciences, Beijing, 100000, China
| | - Dong Wang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, Shandong, China
| |
Collapse
|
16
|
Yang L, Ainali C, Tsoka S, Papageorgiou LG. Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework. BMC Bioinformatics 2014; 15:390. [PMID: 25475756 PMCID: PMC4269079 DOI: 10.1186/s12859-014-0390-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 11/19/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. RESULTS A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. CONCLUSIONS The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.
Collapse
Affiliation(s)
- Lingjian Yang
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK.
| | - Chrysanthi Ainali
- Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, WC2R 2LS, UK.
| | - Sophia Tsoka
- Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, WC2R 2LS, UK.
| | - Lazaros G Papageorgiou
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK.
| |
Collapse
|