1
|
Chen J, Wen B. Bi-level gene selection of cancer by combining clustering and sparse learning. Comput Biol Med 2024; 172:108236. [PMID: 38471351 DOI: 10.1016/j.compbiomed.2024.108236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 02/07/2024] [Accepted: 02/25/2024] [Indexed: 03/14/2024]
Abstract
The diagnosis of cancer based on gene expression profile data has attracted extensive attention in the field of biomedical science. This type of data usually has the characteristics of high dimensionality and noise. In this paper, a hybrid gene selection method based on clustering and sparse learning is proposed to choose the key genes with high precision. We first propose a filter method, which combines the k-means clustering algorithm and signal-to-noise ratio ranking method, and then, a weighted gene co-expression network has been applied to the reduced data set to identify modules corresponding to biological pathways. Moreover, we choose the key genes by using group bridge and sparse group lasso as wrapper methods. Finally, we conduct some numerical experiments on six cancer datasets. The numerical results show that our proposed method has achieved good performance in gene selection and cancer classification.
Collapse
Affiliation(s)
- Junnan Chen
- School of Science, Hebei University of Technology, Tianjin, PR China.
| | - Bo Wen
- Institute of Mathematics, Hebei University of Technology, Tianjin, PR China.
| |
Collapse
|
2
|
Wu Y, Sa Y, Guo Y, Li Q, Zhang N. Identification of WHO II/III gliomas by 16 prognostic-related gene signatures using machine learning methods. Curr Med Chem 2021; 29:1622-1639. [PMID: 34455959 DOI: 10.2174/0929867328666210827103049] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 05/27/2021] [Accepted: 05/28/2021] [Indexed: 11/22/2022]
Abstract
BACKGROUND It is found that the prognosis of gliomas of the same grade has large differences among World Health Organization(WHO) grade II and III in clinical observation. Therefore, a better understanding of the genetics and molecular mechanisms underlying WHO grade II and III gliomas is required, with the aim of developing a classification scheme at the molecular level rather than the conventional pathological morphology level. METHOD We performed survival analysis combined with machine learning methods of Least Absolute Shrinkage and Selection Operator using expression datasets downloaded from the Chinese Glioma Genome Atlas as well as The Cancer Genome Atlas. Risk scores were calculated by the product of expression level of overall survival-related genes and their multivariate Cox proportional hazards regression coefficients. WHO grade II and III gliomas were categorized into the low-risk subgroup, medium-risk subgroup, and high-risk subgroup. We used the 16 prognostic-related genes as input features to build a classification model based on prognosis using a fully connected neural network. Gene function annotations were also performed. RESULTS The 16 genes (AKNAD1, C7orf13, CDK20, CHRFAM7A, CHRNA1, EFNB1, GAS1, HIST2H2BE, KCNK3, KLHL4, LRRK2, NXPH3, PIGZ, SAMD5, ERINC2, and SIX6) related to the glioma prognosis were screened. The 16 selected genes were associated with the development of gliomas and carcinogenesis. The accuracy of an external validation data set of the fully connected neural network model from the two cohorts reached 95.5%. Our method has good potential capability in classifying WHO grade II and III gliomas into low-risk, medium-risk, and high-risk subgroups. The subgroups showed significant (P<0.01) differences in overall survival. CONCLUSION This resulted in the identification of 16 genes that were related to the prognosis of gliomas. Here we developed a computational method to discriminate WHO grade II and III gliomas into three subgroups with distinct prognoses. The gene expression-based method provides a reliable alternative to determine the prognosis of gliomas.
Collapse
Affiliation(s)
- YaMeng Wu
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Yu Sa
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Yu Guo
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - QiFeng Li
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| | - Ning Zhang
- Department of Biomedical Engineering, Tianjin Key Lab of BME Measurement, Tianjin University, Tianjin. China
| |
Collapse
|
3
|
Acharya S, Saha S, Nikhil N. Unsupervised gene selection using biological knowledge : application in sample clustering. BMC Bioinformatics 2017; 18:513. [PMID: 29166852 PMCID: PMC5700545 DOI: 10.1186/s12859-017-1933-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 11/08/2017] [Indexed: 11/10/2022] Open
Abstract
Background Classification of biological samples of gene expression data is a basic building block in solving several problems in the field of bioinformatics like cancer and other disease diagnosis and making a proper treatment plan. One big challenge in sample classification is handling large dimensional and redundant gene expression data. To reduce the complexity of handling this high dimensional data, gene/feature selection plays a major role. Results The current paper explores the use of biological knowledge acquired from Gene Ontology database in selecting the proper subset of genes which can further participate in clustering of samples. The proposed feature selection technique is unsupervised in nature as it does not utilize any class label information in the process of gene selection. At the end, a multi-objective clustering approach is deployed to cluster the available set of samples in the reduced gene space. Conclusions Reported results show that consideration of biological knowledge in gene selection technique not only reduces the feature space dimensionality in great extent but also improves the accuracy of sample classification. The obtained reduced gene space is validated using strong biological significance tests. In order to prove the supremacy of our proposed gene selection based sample clustering technique, a thorough comparative analysis has also been performed with state-of-the-art techniques.
Collapse
Affiliation(s)
- Sudipta Acharya
- IIT Patna, Department of Computer Science and engineering, Patna, India.
| | - Sriparna Saha
- IIT Patna, Department of Computer Science and engineering, Patna, India
| | - N Nikhil
- IIT Ropar, Department of Computer Science and engineering, Punjab, India
| |
Collapse
|
4
|
Delgado E, Boisen MM, Laskey R, Chen R, Song C, Sallit J, Yochum ZA, Andersen CL, Sikora MJ, Wagner J, Safe S, Elishaev E, Lee A, Edwards RP, Haluska P, Tseng G, Schurdak M, Oesterreich S. High expression of orphan nuclear receptor NR4A1 in a subset of ovarian tumors with worse outcome. Gynecol Oncol 2016; 141:348-356. [PMID: 26946093 DOI: 10.1016/j.ygyno.2016.02.030] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2015] [Revised: 02/18/2016] [Accepted: 02/22/2016] [Indexed: 12/21/2022]
Abstract
OBJECTIVE Nuclear receptors (NRs) play a vital role in the development and progression of several cancers including breast and prostate. Using TCGA data, we sought to identify critical nuclear receptors in high grade serous ovarian cancers (HGSOC) and to confirm these findings using in vitro approaches. METHODS In silico analysis of TCGA data was performed to identify relevant NRs in HGSOC. Ovarian cancer cell lines were screened for NR expression and functional studies were performed to determine the significance of these NRs in ovarian cancers. NR expression was analyzed in ovarian cancer tissue samples using immunohistochemistry to identify correlations with histology and stage of disease. RESULTS The NR4A family of NRs was identified as a potential driver of ovarian cancer pathogenesis. Overexpression of NR4A1 in particular correlated with worse progression free survival. Endogenous expression of NR4A1 in normal ovarian samples was relatively high compared to that of other tissue types, suggesting a unique role for this orphan receptor in the ovary. Expression of NR4A1 in HGSOC cell lines as well as in patient samples was variable. NR4A1 primarily localized to the nucleus in normal ovarian tissue while co-localization within the cytoplasm and nucleus was noted in ovarian cancer cell lines and patient tissues. CONCLUSIONS NR4A1 is highly expressed in a subset of HGSOC samples from patients that have a worse progression free survival. Studies to target NR4A1 for therapeutic intervention should include HGSOC.
Collapse
MESH Headings
- Animals
- Carcinoma, Ovarian Epithelial
- Cell Line, Tumor
- Female
- Genome
- Heterografts
- Humans
- Immunohistochemistry
- Mice
- Mice, SCID
- Neoplasms, Glandular and Epithelial/genetics
- Neoplasms, Glandular and Epithelial/metabolism
- Nuclear Receptor Subfamily 4, Group A, Member 1/biosynthesis
- Nuclear Receptor Subfamily 4, Group A, Member 1/genetics
- Ovarian Neoplasms/genetics
- Ovarian Neoplasms/metabolism
- Prognosis
- RNA, Messenger/genetics
- RNA, Messenger/metabolism
Collapse
Affiliation(s)
- Evan Delgado
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
| | - Michelle M Boisen
- Division of Gynecologic Oncology, Magee-Womens Hospital of the University of Pittsburgh Medical Center, Pittsburgh, PA, USA.
| | - Robin Laskey
- Division of Gynecologic Oncology, Magee-Womens Hospital of the University of Pittsburgh Medical Center, Pittsburgh, PA, USA
| | - Rui Chen
- Department of Biostatistics and Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chi Song
- Department of Biostatistics and Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| | | | - Zachary A Yochum
- Department of Medicine, Division of Hematology Oncology, University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA
| | - Courtney L Andersen
- Department of Pharmacology and Chemical Biology, Womens Cancer Research Center, Magee-Womens Research Institute, and University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA; Molecular Pharmacology Training Program, University of Pittsburgh School of Medicine, Pittsburgh, PA
| | - Matthew J Sikora
- Department of Pharmacology and Chemical Biology, Womens Cancer Research Center, Magee-Womens Research Institute, and University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA
| | - Jacob Wagner
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
| | - Stephen Safe
- Department of Veterinary Physiology and Pharmacology, Texas A&M University, College Station, TX, USA
| | - Esther Elishaev
- Department of Pathology, Magee-Womens Hospital of the University of Pittsburgh Medical Center, Pittsburgh, PA, USA
| | - Adrian Lee
- Department of Pharmacology and Chemical Biology, Womens Cancer Research Center, Magee-Womens Research Institute, and University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA
| | - Robert P Edwards
- Division of Gynecologic Oncology, Magee-Womens Hospital of the University of Pittsburgh Medical Center, Pittsburgh, PA, USA
| | - Paul Haluska
- Department of Oncology and Pharmacology, Mayo Clinic, Rochester, MN, USA
| | - George Tseng
- Department of Biostatistics and Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Mark Schurdak
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
| | - Steffi Oesterreich
- Department of Pharmacology and Chemical Biology, Womens Cancer Research Center, Magee-Womens Research Institute, and University of Pittsburgh Cancer Institute, Pittsburgh, PA, USA
| |
Collapse
|
5
|
Acharya S, Saha S, Thadisina Y. Multiobjective Simulated Annealing-Based Clustering of Tissue Samples for Cancer Diagnosis. IEEE J Biomed Health Inform 2016; 20:691-8. [DOI: 10.1109/jbhi.2015.2404971] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
6
|
Acharya S, Saha S. Importance of proximity measures in clustering of cancer and miRNA datasets: proposal of an automated framework. MOLECULAR BIOSYSTEMS 2016; 12:3478-3501. [DOI: 10.1039/c6mb00609d] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Distance plays an important role in the clustering process for allocating data points to different clusters.
Collapse
Affiliation(s)
- Sudipta Acharya
- Department of Computer Science and Engineering
- Indian Institute of Technology Patna
- India
| | - Sriparna Saha
- Department of Computer Science and Engineering
- Indian Institute of Technology Patna
- India
| |
Collapse
|
7
|
Diaz-Cano SJ. Pathological bases for a robust application of cancer molecular classification. Int J Mol Sci 2015; 16:8655-75. [PMID: 25898411 PMCID: PMC4425102 DOI: 10.3390/ijms16048655] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2015] [Accepted: 04/07/2015] [Indexed: 12/12/2022] Open
Abstract
Any robust classification system depends on its purpose and must refer to accepted standards, its strength relying on predictive values and a careful consideration of known factors that can affect its reliability. In this context, a molecular classification of human cancer must refer to the current gold standard (histological classification) and try to improve it with key prognosticators for metastatic potential, staging and grading. Although organ-specific examples have been published based on proteomics, transcriptomics and genomics evaluations, the most popular approach uses gene expression analysis as a direct correlate of cellular differentiation, which represents the key feature of the histological classification. RNA is a labile molecule that varies significantly according with the preservation protocol, its transcription reflect the adaptation of the tumor cells to the microenvironment, it can be passed through mechanisms of intercellular transference of genetic information (exosomes), and it is exposed to epigenetic modifications. More robust classifications should be based on stable molecules, at the genetic level represented by DNA to improve reliability, and its analysis must deal with the concept of intratumoral heterogeneity, which is at the origin of tumor progression and is the byproduct of the selection process during the clonal expansion and progression of neoplasms. The simultaneous analysis of multiple DNA targets and next generation sequencing offer the best practical approach for an analytical genomic classification of tumors.
Collapse
Affiliation(s)
- Salvador J Diaz-Cano
- King's Health Partners, Cancer Studies, King's College Hospital-Viapath, Denmark Hill, London SE5-9RS, UK.
| |
Collapse
|
8
|
Wang HW, Sun HJ, Chang TY, Lo HH, Cheng WC, Tseng GC, Lin CT, Chang SJ, Pal N, Chung IF. Discovering monotonic stemness marker genes from time-series stem cell microarray data. BMC Genomics 2015; 16 Suppl 2:S2. [PMID: 25708300 PMCID: PMC4331716 DOI: 10.1186/1471-2164-16-s2-s2] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Identification of genes with ascending or descending monotonic expression patterns over time or stages of stem cells is an important issue in time-series microarray data analysis. We propose a method named Monotonic Feature Selector (MFSelector) based on a concept of total discriminating error (DEtotal) to identify monotonic genes. MFSelector considers various time stages in stage order (i.e., Stage One vs. other stages, Stages One and Two vs. remaining stages and so on) and computes DEtotal of each gene. MFSelector can successfully identify genes with monotonic characteristics. Results We have demonstrated the effectiveness of MFSelector on two synthetic data sets and two stem cell differentiation data sets: embryonic stem cell neurogenesis (ESCN) and embryonic stem cell vasculogenesis (ESCV) data sets. We have also performed extensive quantitative comparisons of the three monotonic gene selection approaches. Some of the monotonic marker genes such as OCT4, NANOG, BLBP, discovered from the ESCN dataset exhibit consistent behavior with that reported in other studies. The role of monotonic genes found by MFSelector in either stemness or differentiation is validated using information obtained from Gene Ontology analysis and other literature. We justify and demonstrate that descending genes are involved in the proliferation or self-renewal activity of stem cells, while ascending genes are involved in differentiation of stem cells into variant cell lineages. Conclusions We have developed a novel system, easy to use even with no pre-existing knowledge, to identify gene sets with monotonic expression patterns in multi-stage as well as in time-series genomics matrices. The case studies on ESCN and ESCV have helped to get a better understanding of stemness and differentiation. The novel monotonic marker genes discovered from a data set are found to exhibit consistent behavior in another independent data set, demonstrating the utility of the proposed method. The MFSelector R function and data sets can be downloaded from: http://microarray.ym.edu.tw/tools/MFSelector/.
Collapse
|
9
|
|
10
|
Rajapakse JC, Mundra PA. Multiclass gene selection using Pareto-fronts. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:87-97. [PMID: 23702546 DOI: 10.1109/tcbb.2013.1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Filter methods are often used for selection of genes in multiclass sample classification by using microarray data. Such techniques usually tend to bias toward a few classes that are easily distinguishable from other classes due to imbalances of strong features and sample sizes of different classes. It could therefore lead to selection of redundant genes while missing the relevant genes, leading to poor classification of tissue samples. In this manuscript, we propose to decompose multiclass ranking statistics into class-specific statistics and then use Pareto-front analysis for selection of genes. This alleviates the bias induced by class intrinsic characteristics of dominating classes. The use of Pareto-front analysis is demonstrated on two filter criteria commonly used for gene selection: F-score and KW-score. A significant improvement in classification performance and reduction in redundancy among top-ranked genes were achieved in experiments with both synthetic and real-benchmark data sets.
Collapse
|
11
|
Wu MY, Dai DQ, Shi Y, Yan H, Zhang XF. Biomarker identification and cancer classification based on microarray data using Laplace naive Bayes model with mean shrinkage. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1649-1662. [PMID: 22868679 DOI: 10.1109/tcbb.2012.105] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University,Guangzhou 510275, China.
| | | | | | | | | |
Collapse
|
12
|
Tsai YS, Aguan K, Pal NR, Chung IF. Identification of single- and multiple-class specific signature genes from gene expression profiles by group marker index. PLoS One 2011; 6:e24259. [PMID: 21909426 PMCID: PMC3164723 DOI: 10.1371/journal.pone.0024259] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2011] [Accepted: 08/06/2011] [Indexed: 01/06/2023] Open
Abstract
Informative genes from microarray data can be used to construct prediction model and investigate biological mechanisms. Differentially expressed genes, the main targets of most gene selection methods, can be classified as single- and multiple-class specific signature genes. Here, we present a novel gene selection algorithm based on a Group Marker Index (GMI), which is intuitive, of low-computational complexity, and efficient in identification of both types of genes. Most gene selection methods identify only single-class specific signature genes and cannot identify multiple-class specific signature genes easily. Our algorithm can detect de novo certain conditions of multiple-class specificity of a gene and makes use of a novel non-parametric indicator to assess the discrimination ability between classes. Our method is effective even when the sample size is small as well as when the class sizes are significantly different. To compare the effectiveness and robustness we formulate an intuitive template-based method and use four well-known datasets. We demonstrate that our algorithm outperforms the template-based method in difficult cases with unbalanced distribution. Moreover, the multiple-class specific genes are good biomarkers and play important roles in biological pathways. Our literature survey supports that the proposed method identifies unique multiple-class specific marker genes (not reported earlier to be related to cancer) in the Central Nervous System data. It also discovers unique biomarkers indicating the intrinsic difference between subtypes of lung cancer. We also associate the pathway information with the multiple-class specific signature genes and cross-reference to published studies. We find that the identified genes participate in the pathways directly involved in cancer development in leukemia data. Our method gives a promising way to find genes that can involve in pathways of multiple diseases and hence opens up the possibility of using an existing drug on other diseases as well as designing a single drug for multiple diseases.
Collapse
Affiliation(s)
- Yu-Shuen Tsai
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
| | - Kripamoy Aguan
- Department of Biotechnology & Bioinformatics, North Eastern Hill University, Shillong, India
| | - Nikhil R. Pal
- Electronics & Communication Sciences Unit, Indian Statistical Institute, Calcutta, India
- * E-mail: (I-FC); (NRP)
| | - I-Fang Chung
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
- Center for Systems and Synthetic Biology, National Yang-Ming University, Taipei, Taiwan
- * E-mail: (I-FC); (NRP)
| |
Collapse
|