1
|
R. S, B.R. N, Radhakrishnan R, P. S. Computational intelligence for early detection of infertility in women. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE 2024; 127:107400. [DOI: 10.1016/j.engappai.2023.107400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2025]
|
2
|
Li C, Wang X, Du G, Chen H, Brown G, Lewis MM, Yao T, Li R, Huang X. Folded concave penalized learning of high-dimensional MRI data in Parkinson's disease. J Neurosci Methods 2021; 357:109157. [PMID: 33781789 PMCID: PMC10871067 DOI: 10.1016/j.jneumeth.2021.109157] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Revised: 03/17/2021] [Accepted: 03/22/2021] [Indexed: 12/21/2022]
Abstract
BACKGROUND Brain MRI is a promising technique for Parkinson's disease (PD) biomarker development. Its analysis, however, is hindered by the high-dimensional nature of the data, particularly when the sample size is relatively small. NEW METHOD This study introduces a folded concave penalized machine learning scheme with spatial coupling fused penalty (fused FCP) to build biomarkers for PD directly from whole-brain voxel-wise MRI data. The penalized maximum likelihood estimation problem of the model is solved by local linear approximation. RESULTS The proposed approach is evaluated on synthetic and Parkinson's Progression Marker Initiative (PPMI) data. It achieves good AUC scores, accuracy in classification, and biomarker identification with a relatively small sample size, and the results are robust for different tuning parameter choices. On the PPMI data, the proposed method discovers over 80 % of large regions of interest (ROIs) identified by the voxel-wise method, as well as potential new ROIs. COMPARISON WITH EXISTING METHODS The fused FCP approach is compared with L1, fused-L1, and FCP method using three popular machine learning algorithms, logistic regression, support vector machine, and linear discriminant analysis, as well as the voxel-wise method, on both synthetic and PPMI datasets. The fused FCP method demonstrated better accuracy in separating PD from controls than L1 and fused-L1 methods, and similar performance when compared with FCP method. In addition, the fused FCP method showed better ROI identification. CONCLUSIONS The fused FCP method can be an effective approach for MRI biomarker discovery in PD and other studies using high dimensionality data/low sample sizes.
Collapse
Affiliation(s)
- Changcheng Li
- Department of Statistics, Penn State University, University Park, PA, United States
| | - Xue Wang
- Alibaba DAMO Academy, Seattle, WA, United States
| | - Guangwei Du
- Department of Neurology, Penn State Hershey Medical Center, Hershey, PA, United States; Department of Radiology, Penn State Hershey Medical Center, Hershey, PA, United States.
| | - Hairong Chen
- Department of Neurology, Penn State Hershey Medical Center, Hershey, PA, United States
| | - Gregory Brown
- Department of Neurology, Penn State Hershey Medical Center, Hershey, PA, United States
| | - Mechelle M Lewis
- Department of Neurology, Penn State Hershey Medical Center, Hershey, PA, United States; Department of Pharmacology, Penn State Hershey Medical Center, Hershey, PA, United States
| | - Tao Yao
- Alibaba DAMO Academy, Seattle, WA, United States
| | - Runze Li
- Department of Statistics, Penn State University, University Park, PA, United States.
| | - Xuemei Huang
- Department of Neurology, Penn State Hershey Medical Center, Hershey, PA, United States; Department of Pharmacology, Penn State Hershey Medical Center, Hershey, PA, United States; Department of Radiology, Penn State Hershey Medical Center, Hershey, PA, United States; Department of Neurosurgery, Penn State Hershey Medical Center, Hershey, PA, United States; Department of Kinesiology, Penn State Hershey Medical Center, Hershey, PA, United States
| |
Collapse
|
3
|
EKNN: Ensemble classifier incorporating connectivity and density into kNN with application to cancer diagnosis. Artif Intell Med 2020; 111:101985. [PMID: 33461685 DOI: 10.1016/j.artmed.2020.101985] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2019] [Revised: 11/02/2020] [Accepted: 11/02/2020] [Indexed: 11/20/2022]
Abstract
In the microarray-based approach for automated cancer diagnosis, the application of the traditional k-nearest neighbors kNN algorithm suffers from several difficulties such as the large number of genes (high dimensionality of the feature space) with many irrelevant genes (noise) relative to the small number of available samples and the imbalance in the size of the samples of the target classes. This research provides an ensemble classifier based on decision models derived from kNN that is applicable to problems characterized by imbalanced small size datasets. The proposed classification method is an ensemble of the traditional kNN algorithm and four novel classification models derived from it. The proposed models exploit the increase in density and connectivity using K1-nearest neighbors table (KNN-table) created during the training phase. In the density model, an unseen sample u is classified as belonging to a class t if it achieves the highest increase in density when this sample is added to it i.e. the unseen sample can replace more neighbors in the KNN-table for samples of class t than other classes. In the other three connectivity models, the mean and standard deviation of the distribution of the average, minimum as well the maximum distance to the K neighbors of the members of each class are computed in the training phase. The class t to which u achieves the highest possibility of belongness to its distribution is chosen, i.e. the addition of u to the samples of this class produces the least change to the distribution of the corresponding decision model for class t. Combining the predicted results of the four individual models along with traditional kNN makes the decision space more discriminative. With the help of the KNN-table which can be updated online in the training phase, an improved performance has been achieved compared to the traditional kNN algorithm with slight increase in classification time. The proposed ensemble method achieves significant increase in accuracy compared to the accuracy achieved using any of its base classifiers on Kentridge, GDS3257, Notterman, Leukemia and CNS datasets. The method is also compared to several existing ensemble methods and state of the art techniques using different dimensionality reduction techniques on several standard datasets. The results prove clear superiority of EKNN over several individual and ensemble classifiers regardless of the choice of the gene selection strategy.
Collapse
|
4
|
Kang D, Ahn H, Lee S, Lee CJ, Hur J, Jung W, Kim S. StressGenePred: a twin prediction model architecture for classifying the stress types of samples and discovering stress-related genes in arabidopsis. BMC Genomics 2019; 20:949. [PMID: 31856731 PMCID: PMC6923958 DOI: 10.1186/s12864-019-6283-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Recently, a number of studies have been conducted to investigate how plants respond to stress at the cellular molecular level by measuring gene expression profiles over time. As a result, a set of time-series gene expression data for the stress response are available in databases. With the data, an integrated analysis of multiple stresses is possible, which identifies stress-responsive genes with higher specificity because considering multiple stress can capture the effect of interference between stresses. To analyze such data, a machine learning model needs to be built. Results In this study, we developed StressGenePred, a neural network-based machine learning method, to integrate time-series transcriptome data of multiple stress types. StressGenePred is designed to detect single stress-specific biomarker genes by using a simple feature embedding method, a twin neural network model, and Confident Multiple Choice Learning (CMCL) loss. The twin neural network model consists of a biomarker gene discovery and a stress type prediction model that share the same logical layer to reduce training complexity. The CMCL loss is used to make the twin model select biomarker genes that respond specifically to a single stress. In experiments using Arabidopsis gene expression data for four major environmental stresses, such as heat, cold, salt, and drought, StressGenePred classified the types of stress more accurately than the limma feature embedding method and the support vector machine and random forest classification methods. In addition, StressGenePred discovered known stress-related genes with higher specificity than the Fisher method. Conclusions StressGenePred is a machine learning method for identifying stress-related genes and predicting stress types for an integrated analysis of multiple stress time-series transcriptome data. This method can be used to other phenotype-gene associated studies.
Collapse
Affiliation(s)
- Dongwon Kang
- Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea
| | - Hongryul Ahn
- Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea
| | - Sangseon Lee
- Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea
| | - Chai-Jin Lee
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Jihye Hur
- Department of Crop Science, Konkuk University, Seoul, Republic of Korea
| | - Woosuk Jung
- Department of Crop Science, Konkuk University, Seoul, Republic of Korea.
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea. .,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea. .,Bioinformatics Institute, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
5
|
An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets. BMC Bioinformatics 2019; 20:433. [PMID: 31438843 PMCID: PMC6704630 DOI: 10.1186/s12859-019-2994-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Accepted: 07/15/2019] [Indexed: 02/08/2023] Open
Abstract
Background Host immune response is coordinated by a variety of different specialized cell types that vary in time and location. While host immune response can be studied using conventional low-dimensional approaches, advances in transcriptomics analysis may provide a less biased view. Yet, leveraging transcriptomics data to identify immune cell subtypes presents challenges for extracting informative gene signatures hidden within a high dimensional transcriptomics space characterized by low sample numbers with noisy and missing values. To address these challenges, we explore using machine learning methods to select gene subsets and estimate gene coefficients simultaneously. Results Elastic-net logistic regression, a type of machine learning, was used to construct separate classifiers for ten different types of immune cell and for five T helper cell subsets. The resulting classifiers were then used to develop gene signatures that best discriminate among immune cell types and T helper cell subsets using RNA-seq datasets. We validated the approach using single-cell RNA-seq (scRNA-seq) datasets, which gave consistent results. In addition, we classified cell types that were previously unannotated. Finally, we benchmarked the proposed gene signatures against other existing gene signatures. Conclusions Developed classifiers can be used as priors in predicting the extent and functional orientation of the host immune response in diseases, such as cancer, where transcriptomic profiling of bulk tissue samples and single cells are routinely employed. Information that can provide insight into the mechanistic basis of disease and therapeutic response. The source code and documentation are available through GitHub: https://github.com/KlinkeLab/ImmClass2019. Electronic supplementary material The online version of this article (10.1186/s12859-019-2994-z) contains supplementary material, which is available to authorized users.
Collapse
|
6
|
Abstract
The automatic classification of DNA microarray data is one of the hot topics in the field of bioinformatics, since it is an effective tool for the diagnosis of diseases in patients. The aim of this chapter is to present the most relevant aspects related to the classification of microarrays. We carried out an analysis of the strategies used for the classification of microarray data and a review of the main methods used in the literature. In addition, other related aspects are addressed as the reduction of dimensionality, to try to eliminate redundant information in genes, or the treatment of imbalanced data and missing of data. To conclude, we present an exhaustive review of the main scientific works in journals to show the most successful techniques applied in this discipline as well as the most used datasets to verify their effectiveness.
Collapse
|
7
|
Integration of 24 Feature Types to Accurately Detect and Predict Seizures Using Scalp EEG Signals. SENSORS 2018; 18:s18051372. [PMID: 29710763 PMCID: PMC5982573 DOI: 10.3390/s18051372] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2018] [Revised: 04/23/2018] [Accepted: 04/26/2018] [Indexed: 01/22/2023]
Abstract
The neurological disorder epilepsy causes substantial problems to the patients with uncontrolled seizures or even sudden deaths. Accurate detection and prediction of epileptic seizures will significantly improve the life quality of epileptic patients. Various feature extraction algorithms were proposed to describe the EEG signals in frequency or time domains. Both invasive intracranial and non-invasive scalp EEG signals have been screened for the epileptic seizure patterns. This study extracted a comprehensive list of 24 feature types from the scalp EEG signals and found 170 out of the 2794 features for an accurate classification of epileptic seizures. An accuracy (Acc) of 99.40% was optimized for detecting epileptic seizures from the scalp EEG signals. A balanced accuracy (bAcc) was calculated as the average of sensitivity and specificity and our seizure detection model achieved 99.61% in bAcc. The same experimental procedure was applied to predict epileptic seizures in advance, and the model achieved Acc = 99.17% for predicting epileptic seizures 10 s before happening.
Collapse
|
8
|
Wang A, An N, Chen G, Liu L, Alterovitz G. Subtype dependent biomarker identification and tumor classification from gene expression profiles. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.01.025] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
9
|
Jain I, Jain VK, Jain R. Correlation feature selection based improved-Binary Particle Swarm Optimization for gene selection and cancer classification. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.09.038] [Citation(s) in RCA: 224] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
10
|
Urda D, Luque-Baena RM, Franco L, Jerez JM, Sanchez-Marono N. Machine learning models to search relevant genetic signatures in clinical context. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) 2017:1649-1656. [DOI: 10.1109/ijcnn.2017.7966049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2025]
|
11
|
Banerjee S, Anura A, Chakrabarty J, Sengupta S, Chatterjee J. Identification and functional assessment of novel gene sets towards better understanding of dysplasia associated oral carcinogenesis. GENE REPORTS 2016. [DOI: 10.1016/j.genrep.2016.04.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
12
|
Wang L, Wang Y, Chang Q. Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 2016; 111:21-31. [PMID: 27592382 DOI: 10.1016/j.ymeth.2016.08.014] [Citation(s) in RCA: 110] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Revised: 08/25/2016] [Accepted: 08/30/2016] [Indexed: 11/26/2022] Open
Abstract
This paper surveys main principles of feature selection and their recent applications in big data bioinformatics. Instead of the commonly used categorization into filter, wrapper, and embedded approaches to feature selection, we formulate feature selection as a combinatorial optimization or search problem and categorize feature selection methods into exhaustive search, heuristic search, and hybrid methods, where heuristic search methods may further be categorized into those with or without data-distilled feature ranking measures.
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
| | - Yaoli Wang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| | - Qing Chang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| |
Collapse
|
13
|
Lovato P, Bicego M, Kesa M, Jojic N, Murino V, Perina A. Traveling on discrete embeddings of gene expression. Artif Intell Med 2016; 70:1-11. [PMID: 27431033 DOI: 10.1016/j.artmed.2016.05.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2015] [Revised: 05/20/2016] [Accepted: 05/21/2016] [Indexed: 12/24/2022]
Abstract
OBJECTIVE High-throughput technologies have generated an unprecedented amount of high-dimensional gene expression data. Algorithmic approaches could be extremely useful to distill information and derive compact interpretable representations of the statistical patterns present in the data. This paper proposes a mining approach to extract an informative representation of gene expression profiles based on a generative model called the Counting Grid (CG). METHOD Using the CG model, gene expression values are arranged on a discrete grid, learned in a way that "similar" co-expression patterns are arranged in close proximity, thus resulting in an intuitive visualization of the dataset. More than this, the model permits to identify the genes that distinguish between classes (e.g. different types of cancer). Finally, each sample can be characterized with a discriminative signature - extracted from the model - that can be effectively employed for classification. RESULTS A thorough evaluation on several gene expression datasets demonstrate the suitability of the proposed approach from a twofold perspective: numerically, we reached state-of-the-art classification accuracies on 5 datasets out of 7, and similar results when the approach is tested in a gene selection setting (with a stability always above 0.87); clinically, by confirming that many of the genes highlighted by the model as significant play also a key role for cancer biology. CONCLUSION The proposed framework can be successfully exploited to meaningfully visualize the samples; detect medically relevant genes; properly classify samples.
Collapse
Affiliation(s)
- Pietro Lovato
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134 Verona, Italy.
| | - Manuele Bicego
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134 Verona, Italy
| | - Maria Kesa
- Tallinn University of Technology, Ehitajate tee 5, 19086 Tallinn, Estonia
| | - Nebojsa Jojic
- Microsoft Research, One Microsoft Way, 98052 Redmond, WA, USA
| | - Vittorio Murino
- Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia (IIT), Via Morego 30, 16163 Genova, Italy
| | | |
Collapse
|
14
|
Chatterjee P, Pal NR. Construction of synergy networks from gene expression data related to disease. Gene 2016; 590:250-62. [PMID: 27222483 DOI: 10.1016/j.gene.2016.05.029] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2015] [Revised: 03/11/2016] [Accepted: 05/17/2016] [Indexed: 02/07/2023]
Abstract
A few methods have been developed to determine whether genes collaborate with each other in relation to a particular disease using an information theoretic measure of synergy. Here, we propose an alternative definition of synergy and justify that our definition improves upon the existing measures of synergy in the context of gene interactions. We use this definition on a prostate cancer data set consisting of gene expression levels in both cancerous and non-cancerous samples and identify pairs of genes which are unable to discriminate between cancerous and non-cancerous samples individually but can do so jointly when we take their synergistic property into account. We also propose a very simple yet effective technique for computation of conditional entropy at a very low cost. The worst case complexity of our method is O(n) while the best case complexity of a state-of-the-art method is O(n(2)). Furthermore, our method can also be extended to find synergistic relation among triplets or even among a larger number of genes. Finally, we validate our results by demonstrating that these findings cannot be due to pure chance and provide the relevance of the synergistic pairs in cancer biology.
Collapse
Affiliation(s)
- Prantik Chatterjee
- Electronics and Communication Sciences Unit, Indian Statistical Institute, Calcutta, India
| | - Nikhil Ranjan Pal
- Electronics and Communication Sciences Unit, Indian Statistical Institute, Calcutta, India.
| |
Collapse
|
15
|
Yao T, Wang Q, Zhang W, Bian A, Zhang J. Identification of genes associated with renal cell carcinoma using gene expression profiling analysis. Oncol Lett 2016; 12:73-78. [PMID: 27347102 PMCID: PMC4906613 DOI: 10.3892/ol.2016.4573] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2015] [Accepted: 04/22/2016] [Indexed: 02/06/2023] Open
Abstract
Renal cell carcinoma (RCC) is the most common type of kidney cancer in adults and accounts for ~80% of all kidney cancer cases. However, the pathogenesis of RCC has not yet been fully elucidated. To interpret the pathogenesis of RCC at the molecular level, gene expression data and bio-informatics methods were used to identify RCC associated genes. Gene expression data was downloaded from Gene Expression Omnibus (GEO) database and identified differentially coexpressed genes (DCGs) and dysfunctional pathways in RCC patients compared with controls. In addition, a regulatory network was constructed using the known regulatory data between transcription factors (TFs) and target genes in the University of California Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu) and the regulatory impact factor of each TF was calculated. A total of 258,0427 pairs of DCGs were identified. The regulatory network contained 1,525 pairs of regulatory associations between 126 TFs and 1,259 target genes and these genes were mainly enriched in cancer pathways, ErbB and MAPK. In the regulatory network, the 10 most strongly associated TFs were FOXC1, GATA3, ESR1, FOXL1, PATZ1, MYB, STAT5A, EGR2, EGR3 and PELP1. GATA3, ERG and MYB serve important roles in RCC while FOXC1, ESR1, FOXL1, PATZ1, STAT5A and PELP1 may be potential genes associated with RCC. In conclusion, the present study constructed a regulatory network and screened out several TFs that may be used as molecular biomarkers of RCC. However, future studies are needed to confirm the findings of the present study.
Collapse
Affiliation(s)
- Ting Yao
- Physical Examination Center, Laiwu, Shandong 271100, P.R. China
| | - Qinfu Wang
- Department of Chronic Non-Communicable Diseases Control and Prevention, Laiwu Center for Disease Control and Prevention, Laiwu, Shandong 271100, P.R. China
| | - Wenyong Zhang
- Department of Health Education, Laiwu Center for Disease Control and Prevention, Laiwu, Shandong 271100, P.R. China
| | - Aihong Bian
- Department of Health Inspection, Laiwu Center for Disease Control and Prevention, Laiwu, Shandong 271100, P.R. China
| | - Jinping Zhang
- Department of Communicable Diseases Control and Prevention, Laiwu Center for Disease Control and Prevention, Laiwu, Shandong 271100, P.R. China
| |
Collapse
|
16
|
Liu H, Du G, Zhang L, Lewis MM, Wang X, Yao T, Li R, Huang X. Folded concave penalized learning in identifying multimodal MRI marker for Parkinson's disease. J Neurosci Methods 2016; 268:1-6. [PMID: 27102045 DOI: 10.1016/j.jneumeth.2016.04.016] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2015] [Revised: 03/08/2016] [Accepted: 04/16/2016] [Indexed: 12/28/2022]
Abstract
BACKGROUND Brain MRI holds promise to gauge different aspects of Parkinson's disease (PD)-related pathological changes. Its analysis, however, is hindered by the high-dimensional nature of the data. NEW METHOD This study introduces folded concave penalized (FCP) sparse logistic regression to identify biomarkers for PD from a large number of potential factors. The proposed statistical procedures target the challenges of high-dimensionality with limited data samples acquired. The maximization problem associated with the sparse logistic regression model is solved by local linear approximation. The proposed procedures then are applied to the empirical analysis of multimodal MRI data. RESULTS From 45 features, the proposed approach identified 15 MRI markers and the UPSIT, which are known to be clinically relevant to PD. By combining the MRI and clinical markers, we can enhance substantially the specificity and sensitivity of the model, as indicated by the ROC curves. COMPARISON TO EXISTING METHODS We compare the folded concave penalized learning scheme with both the Lasso penalized scheme and the principle component analysis-based feature selection (PCA) in the Parkinson's biomarker identification problem that takes into account both the clinical features and MRI markers. The folded concave penalty method demonstrates a substantially better clinical potential than both the Lasso and PCA in terms of specificity and sensitivity. CONCLUSIONS For the first time, we applied the FCP learning method to MRI biomarker discovery in PD. The proposed approach successfully identified MRI markers that are clinically relevant. Combining these biomarkers with clinical features can substantially enhance performance.
Collapse
Affiliation(s)
- Hongcheng Liu
- Department of Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, United States
| | - Guangwei Du
- Departments of Neurology, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States
| | - Lijun Zhang
- Departments of Biochemistry and Molecular Biology, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States
| | - Mechelle M Lewis
- Departments of Neurology, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States; Departments of Pharmacology, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States
| | - Xue Wang
- Department of Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, United States
| | - Tao Yao
- Department of Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, United States.
| | - Runze Li
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802, United States.
| | - Xuemei Huang
- Departments of Neurology, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States; Departments of Pharmacology, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States; Department of Radiology, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States; Department of Neurosurgery, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States; Department of Kinesiology, Milton S. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, United States.
| |
Collapse
|
17
|
Nguyen T, Khosravi A, Creighton D, Nahavandi S. Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification. PLoS One 2015; 10:e0120364. [PMID: 25823003 PMCID: PMC4378968 DOI: 10.1371/journal.pone.0120364] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Accepted: 02/08/2015] [Indexed: 11/19/2022] Open
Abstract
This paper introduces a novel approach to gene selection based on a substantial modification of analytic hierarchy process (AHP). The modified AHP systematically integrates outcomes of individual filter methods to select the most informative genes for microarray classification. Five individual ranking methods including t-test, entropy, receiver operating characteristic (ROC) curve, Wilcoxon and signal to noise ratio are employed to rank genes. These ranked genes are then considered as inputs for the modified AHP. Additionally, a method that uses fuzzy standard additive model (FSAM) for cancer classification based on genes selected by AHP is also proposed in this paper. Traditional FSAM learning is a hybrid process comprising unsupervised structure learning and supervised parameter tuning. Genetic algorithm (GA) is incorporated in-between unsupervised and supervised training to optimize the number of fuzzy rules. The integration of GA enables FSAM to deal with the high-dimensional-low-sample nature of microarray data and thus enhance the efficiency of the classification. Experiments are carried out on numerous microarray datasets. Results demonstrate the performance dominance of the AHP-based gene selection against the single ranking methods. Furthermore, the combination of AHP-FSAM shows a great accuracy in microarray data classification compared to various competing classifiers. The proposed approach therefore is useful for medical practitioners and clinicians as a decision support system that can be implemented in the real medical practice.
Collapse
Affiliation(s)
- Thanh Nguyen
- Centre for Intelligent Systems Research (CISR), Deakin University, Geelong Waurn Ponds Campus, Victoria, 3216, Australia
- * E-mail:
| | - Abbas Khosravi
- Centre for Intelligent Systems Research (CISR), Deakin University, Geelong Waurn Ponds Campus, Victoria, 3216, Australia
| | - Douglas Creighton
- Centre for Intelligent Systems Research (CISR), Deakin University, Geelong Waurn Ponds Campus, Victoria, 3216, Australia
| | - Saeid Nahavandi
- Centre for Intelligent Systems Research (CISR), Deakin University, Geelong Waurn Ponds Campus, Victoria, 3216, Australia
| |
Collapse
|
18
|
Rathore S, Hussain M, Khan A. GECC: Gene Expression Based Ensemble Classification of Colon Samples. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:1131-1145. [PMID: 26357050 DOI: 10.1109/tcbb.2014.2344655] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene expression deviates from its normal composition in case a patient has cancer. This variation can be used as an effective tool to find cancer. In this study, we propose a novel gene expressions based colon classification scheme (GECC) that exploits the variations in gene expressions for classifying colon gene samples into normal and malignant classes. Novelty of GECC is in two complementary ways. First, to cater overwhelmingly larger size of gene based data sets, various feature extraction strategies, like, chi-square, F-Score, principal component analysis (PCA) and minimum redundancy and maximum relevancy (mRMR) have been employed, which select discriminative genes amongst a set of genes. Second, a majority voting based ensemble of support vector machine (SVM) has been proposed to classify the given gene based samples. Previously, individual SVM models have been used for colon classification, however, their performance is limited. In this research study, we propose an SVM-ensemble based new approach for gene based classification of colon, wherein the individual SVM models are constructed through the learning of different SVM kernels, like, linear, polynomial, radial basis function (RBF), and sigmoid. The predicted results of individual models are combined through majority voting. In this way, the combined decision space becomes more discriminative. The proposed technique has been tested on four colon, and several other binary-class gene expression data sets, and improved performance has been achieved compared to previously reported gene based colon cancer detection techniques. The computational time required for the training and testing of 208 × 5,851 data set has been 591.01 and 0.019 s, respectively.
Collapse
|
19
|
Bermejo P, Gámez JA, Puerta JM. Speeding up incremental wrapper feature subset selection with Naive Bayes classifier. Knowl Based Syst 2014. [DOI: 10.1016/j.knosys.2013.10.016] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
20
|
Wu MY, Dai DQ, Zhang XF, Zhu Y. Cancer subtype discovery and biomarker identification via a new robust network clustering algorithm. PLoS One 2013; 8:e66256. [PMID: 23799085 PMCID: PMC3684607 DOI: 10.1371/journal.pone.0066256] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2013] [Accepted: 05/02/2013] [Indexed: 11/29/2022] Open
Abstract
In cancer biology, it is very important to understand the phenotypic changes of the patients and discover new cancer subtypes. Recently, microarray-based technologies have shed light on this problem based on gene expression profiles which may contain outliers due to either chemical or electrical reasons. These undiscovered subtypes may be heterogeneous with respect to underlying networks or pathways, and are related with only a few of interdependent biomarkers. This motivates a need for the robust gene expression-based methods capable of discovering such subtypes, elucidating the corresponding network structures and identifying cancer related biomarkers. This study proposes a penalized model-based Student’s t clustering with unconstrained covariance (PMT-UC) to discover cancer subtypes with cluster-specific networks, taking gene dependencies into account and having robustness against outliers. Meanwhile, biomarker identification and network reconstruction are achieved by imposing an adaptive penalty on the means and the inverse scale matrices. The model is fitted via the expectation maximization algorithm utilizing the graphical lasso. Here, a network-based gene selection criterion that identifies biomarkers not as individual genes but as subnetworks is applied. This allows us to implicate low discriminative biomarkers which play a central role in the subnetwork by interconnecting many differentially expressed genes, or have cluster-specific underlying network structures. Experiment results on simulated datasets and one available cancer dataset attest to the effectiveness, robustness of PMT-UC in cancer subtype discovering. Moveover, PMT-UC has the ability to select cancer related biomarkers which have been verified in biochemical or biomedical research and learn the biological significant correlation among genes.
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Dao-Qing Dai
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- * E-mail:
| | - Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
| | - Yuan Zhu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou, China
- Department of Mathematics, Guangdong University of Business Studies, Guangzhou, China
| |
Collapse
|