1
|
Yu K, Huang M, Chen S, Feng C, Li W. GSEnet: feature extraction of gene expression data and its application to Leukemia classification. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:4881-4891. [PMID: 35430845 DOI: 10.3934/mbe.2022228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Gene expression data is highly dimensional. As disease-related genes account for only a tiny fraction, a deep learning model, namely GSEnet, is proposed to extract instructive features from gene expression data. This model consists of three modules, namely the pre-conv module, the SE-Resnet module, and the SE-conv module. Effectiveness of the proposed model on the performance improvement of 9 representative classifiers is evaluated. Seven evaluation metrics are used for this assessment on the GSE99095 dataset. Robustness and advantages of the proposed model compared with representative feature selection methods are also discussed. Results show superiority of the proposed model on the improvement of the classification precision and accuracy.
Collapse
Affiliation(s)
- Kun Yu
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning 110819, China
- Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Shenyang, Liaoning 110819, China
| | - Mingxu Huang
- School of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, China
| | - Shuaizheng Chen
- School of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, China
| | - Chaolu Feng
- Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Shenyang, Liaoning 110819, China
- School of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Shenyang, Liaoning 110819, China
- School of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110819, China
| |
Collapse
|
2
|
A Genetic Programming Strategy to Induce Logical Rules for Clinical Data Analysis. Processes (Basel) 2020. [DOI: 10.3390/pr8121565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
This paper proposes a machine learning approach dealing with genetic programming to build classifiers through logical rule induction. In this context, we define and test a set of mutation operators across from different clinical datasets to improve the performance of the proposal for each dataset. The use of genetic programming for rule induction has generated interesting results in machine learning problems. Hence, genetic programming represents a flexible and powerful evolutionary technique for automatic generation of classifiers. Since logical rules disclose knowledge from the analyzed data, we use such knowledge to interpret the results and filter the most important features from clinical data as a process of knowledge discovery. The ultimate goal of this proposal is to provide the experts in the data domain with prior knowledge (as a guide) about the structure of the data and the rules found for each class, especially to track dichotomies and inequality. The results reached by our proposal on the involved datasets have been very promising when used in classification tasks and compared with other methods.
Collapse
|
3
|
|
4
|
Maniruzzaman M, Jahanur Rahman M, Ahammed B, Abedin MM, Suri HS, Biswas M, El-Baz A, Bangeas P, Tsoulfas G, Suri JS. Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 176:173-193. [PMID: 31200905 DOI: 10.1016/j.cmpb.2019.04.008] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2018] [Revised: 02/28/2019] [Accepted: 04/08/2019] [Indexed: 02/08/2023]
Abstract
OBJECTIVE A colon microarray data is a repository of thousands of gene expressions with different strengths for each cancer cell. It is necessary to detect which genes are responsible for cancer growth. This study presents an exhaustive comparative study of different machine learning (ML) systems which serves two major purposes: (a) identification of high risk differential genes using statistical tests and (b) development of a ML strategy for predicting cancer genes. METHODS Four statistical tests namely: Wilcoxon sign rank sum (WCSRS), t test, Kruskal-Wallis (KW), and F-test were adapted for cancerous gene identification using their p-values. The extracted gene set was used to classify cancer patients using ten classifiers namely: linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), naïve Bayes (NB), Gaussian process classification (GPC), support vector machine (SVM), artificial neural network (ANN), logistic regression (LR), decision tree (DT), Adaboost (AB), and random forest (RF). Performance was then evaluated using cross-validation protocols and standardized metrics viz. accuracy (ACC) and area under the curve (AUC). RESULTS The colon cancer dataset consists of 2000 genes from 62 patients (40 cancer vs. 22 control). The overall mean ACC of our ML system using all four statistical tests and all ten classifiers was 90.50%. The ML system showed an ACC of 99.81% using a combination WCSRS test and RF-based classifier. This is an improvement of 8% over previously published values in literature. CONCLUSIONS RF-based model with statistical tests for detection of high risk genes showed the best performance for accurate cancer classification in multi-center clinical trials.
Collapse
Affiliation(s)
- Md Maniruzzaman
- Statistics Discipline, Khulna University, Khulna, Bangladesh; Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh
| | - Md Jahanur Rahman
- Department of Statistics, University of Rajshahi, Rajshahi, Bangladesh
| | - Benojir Ahammed
- Statistics Discipline, Khulna University, Khulna, Bangladesh
| | | | | | - Mainak Biswas
- Advanced Knowledge Engineering Centre, Global Biomedical Technologies, Inc., Roseville, CA, USA
| | - Ayman El-Baz
- Department of Bioengineering, University of Louisville, Louisville, Kentucky, USA
| | - Petros Bangeas
- Department of Surgery, Papageorgiou Hospital, Aristotle University Thessaloniki, Greece
| | - Georgios Tsoulfas
- Department of Surgery, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Jasjit S Suri
- Advanced Knowledge Engineering Centre, Global Biomedical Technologies, Inc., Roseville, CA, USA; AtheroPoint, Roseville, CA, USA.
| |
Collapse
|
5
|
Wu P, Wang D. Classification of a DNA Microarray for Diagnosing Cancer Using a Complex Network Based Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:801-808. [PMID: 30183642 DOI: 10.1109/tcbb.2018.2868341] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Applications that classify DNA microarray expression data are helpful for diagnosing cancer. Many attempts have been made to analyze these data; however, new methods are needed to obtain better results. In this study, a Complex Network (CN) classifier was exploited to implement the classification task. An algorithm was used to initialize the structure, which allowed input variables to be selected over layered connections and different activation functions for different nodes. Then, a hybrid method integrated the Genetic Programming and the Particle Swarm Optimization algorithms was used to identify an optimal structure with the parameters encoded in the classifier. The single CN classifier and an ensemble of CN classifiers were tested on four bench data sets. To ensure diversity of the ensemble classifiers, we constructed a base classifier using different feature sets, i.e., Pearson's correlation, Spearman's correlation, Euclidean distance, Cosine coefficient and the Fisher-ratio. The experimental results suggest that a single classifier can be used to obtain state-of-the-art results and the ensemble yielded better results.
Collapse
|
6
|
Veiga RV, Barbosa HJC, Bernardino HS, Freitas JM, Feitosa CA, Matos SMA, Alcântara-Neves NM, Barreto ML. Multiobjective grammar-based genetic programming applied to the study of asthma and allergy epidemiology. BMC Bioinformatics 2018; 19:245. [PMID: 29940834 PMCID: PMC6047363 DOI: 10.1186/s12859-018-2233-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Accepted: 06/04/2018] [Indexed: 12/22/2022] Open
Abstract
Background Asthma and allergies prevalence increased in recent decades, being a serious global health problem. They are complex diseases with strong contextual influence, so that the use of advanced machine learning tools such as genetic programming could be important for the understanding the causal mechanisms explaining those conditions. Here, we applied a multiobjective grammar-based genetic programming (MGGP) to a dataset composed by 1047 subjects. The dataset contains information on the environmental, psychosocial, socioeconomics, nutritional and infectious factors collected from participating children. The objective of this work is to generate models that explain the occurrence of asthma, and two markers of allergy: presence of IgE antibody against common allergens, and skin prick test positivity for common allergens (SPT). Results The average of the accuracies of the models for asthma higher in MGGP than C4.5. IgE were higher in MGGP than in both, logistic regression and C4.5. MGGP had levels of accuracy similar to RF, but unlike RF, MGGP was able to generate models that were easy to interpret. Conclusions MGGP has shown that infections, psychosocial, nutritional, hygiene, and socioeconomic factors may be related in such an intricate way, that could be hardly detected using traditional regression based epidemiological techniques. The algorithm MGGP was implemented in c ++ and is available on repository: http://bitbucket.org/ciml-ufjf/ciml-lib. Electronic supplementary material The online version of this article (10.1186/s12859-018-2233-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rafael V Veiga
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Muniz, Fundação Oswaldo Cruz, Salvador, Brazil. .,Universidade Federal de Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil.
| | - Helio J C Barbosa
- Universidade Federal de Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil.,Laboraório Nacional de Computação Científica, Petrópolis, Rio de Janeiro, Brazil
| | - Heder S Bernardino
- Universidade Federal de Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil
| | - João M Freitas
- Universidade Federal de Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil
| | - Caroline A Feitosa
- Instituto de Saúde Coletiva, Universidade Federal da Bahia, Savador, Bahia, Brazil
| | - Sheila M A Matos
- Instituto de Saúde Coletiva, Universidade Federal da Bahia, Savador, Bahia, Brazil
| | | | - Maurício L Barreto
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Muniz, Fundação Oswaldo Cruz, Salvador, Brazil.,Instituto de Saúde Coletiva, Universidade Federal da Bahia, Savador, Bahia, Brazil
| |
Collapse
|
7
|
Tan MS, Tan JW, Chang SW, Yap HJ, Abdul Kareem S, Zain RB. A genetic programming approach to oral cancer prognosis. PeerJ 2016; 4:e2482. [PMID: 27688975 PMCID: PMC5036111 DOI: 10.7717/peerj.2482] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2016] [Accepted: 08/24/2016] [Indexed: 11/20/2022] Open
Abstract
Background The potential of genetic programming (GP) on various fields has been attained in recent years. In bio-medical field, many researches in GP are focused on the recognition of cancerous cells and also on gene expression profiling data. In this research, the aim is to study the performance of GP on the survival prediction of a small sample size of oral cancer prognosis dataset, which is the first study in the field of oral cancer prognosis. Method GP is applied on an oral cancer dataset that contains 31 cases collected from the Malaysia Oral Cancer Database and Tissue Bank System (MOCDTBS). The feature subsets that is automatically selected through GP were noted and the influences of this subset on the results of GP were recorded. In addition, a comparison between the GP performance and that of the Support Vector Machine (SVM) and logistic regression (LR) are also done in order to verify the predictive capabilities of the GP. Result The result shows that GP performed the best (average accuracy of 83.87% and average AUROC of 0.8341) when the features selected are smoking, drinking, chewing, histological differentiation of SCC, and oncogene p63. In addition, based on the comparison results, we found that the GP outperformed the SVM and LR in oral cancer prognosis. Discussion Some of the features in the dataset are found to be statistically co-related. This is because the accuracy of the GP prediction drops when one of the feature in the best feature subset is excluded. Thus, GP provides an automatic feature selection function, which chooses features that are highly correlated to the prognosis of oral cancer. This makes GP an ideal prediction model for cancer clinical and genomic data that can be used to aid physicians in their decision making stage of diagnosis or prognosis.
Collapse
Affiliation(s)
- Mei Sze Tan
- Bioinformatics Program, Institute of Biological Sciences, Faculty of Science, University of Malaya , Kuala Lumpur , Malaysia
| | - Jing Wei Tan
- Bioinformatics Program, Institute of Biological Sciences, Faculty of Science, University of Malaya , Kuala Lumpur , Malaysia
| | - Siow-Wee Chang
- Bioinformatics Program, Institute of Biological Sciences, Faculty of Science, University of Malaya , Kuala Lumpur , Malaysia
| | - Hwa Jen Yap
- Department of Mechanical Engineering, Faculty of Engineering, University of Malaya , Kuala Lumpur , Malaysia
| | - Sameem Abdul Kareem
- Department of Artificial Intelligence, Faculty of Computer Science & Information Technology, University of Malaya , Kuala Lumpur , Malaysia
| | - Rosnah Binti Zain
- Oral Cancer Research & Coordinating Centre (OCRCC), Faculty of Dentistry, University of Malaya , Kuala Lumpur , Malaysia
| |
Collapse
|
8
|
Sardana M, Agrawal R, Kaur B. A hybrid of clustering and quantum genetic algorithm for relevant genes selection for cancer microarray data. INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS 2016. [DOI: 10.3233/kes-160341] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
| | - R.K. Agrawal
- School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Baljeet Kaur
- Hansraj College, University of Delhi, Delhi, India
| |
Collapse
|
9
|
Genetic programming based ensemble system for microarray data classification. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:193406. [PMID: 25810748 PMCID: PMC4355811 DOI: 10.1155/2015/193406] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2014] [Revised: 01/01/2015] [Accepted: 01/19/2015] [Indexed: 11/18/2022]
Abstract
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.
Collapse
|
10
|
An analysis of accuracy-diversity trade-off for hybrid combined system with multiobjective predictor selection. APPL INTELL 2014. [DOI: 10.1007/s10489-013-0507-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
11
|
|
12
|
A comparison of machine learning techniques for survival prediction in breast cancer. BioData Min 2011; 4:12. [PMID: 21569330 PMCID: PMC3108919 DOI: 10.1186/1756-0381-4-12] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2010] [Accepted: 05/11/2011] [Indexed: 11/17/2022] Open
Abstract
Background The ability to accurately classify cancer patients into risk classes, i.e. to predict the outcome of the pathology on an individual basis, is a key ingredient in making therapeutic decisions. In recent years gene expression data have been successfully used to complement the clinical and histological criteria traditionally used in such prediction. Many "gene expression signatures" have been developed, i.e. sets of genes whose expression values in a tumor can be used to predict the outcome of the pathology. Here we investigate the use of several machine learning techniques to classify breast cancer patients using one of such signatures, the well established 70-gene signature. Results We show that Genetic Programming performs significantly better than Support Vector Machines, Multilayered Perceptrons and Random Forests in classifying patients from the NKI breast cancer dataset, and comparably to the scoring-based method originally proposed by the authors of the 70-gene signature. Furthermore, Genetic Programming is able to perform an automatic feature selection. Conclusions Since the performance of Genetic Programming is likely to be improvable compared to the out-of-the-box approach used here, and given the biological insight potentially provided by the Genetic Programming solutions, we conclude that Genetic Programming methods are worth further investigation as a tool for cancer patient classification based on gene expression data.
Collapse
|
13
|
Benso A, Di Carlo S, Politano G. A cDNA microarray gene expression data classifier for clinical diagnostics based on graph theory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:577-591. [PMID: 20855919 DOI: 10.1109/tcbb.2010.90] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithms.
Collapse
Affiliation(s)
- Alfredo Benso
- Control and Computer Engineering Department, Politecnico di Torino, Corso Duca degli Abruzzi 24, I-10129, Torino, Italy.
| | | | | |
Collapse
|
14
|
Gustafsson MG, Wallman M, Wickenberg Bolin U, Göransson H, Fryknäs M, Andersson CR, Isaksson A. Improving Bayesian credibility intervals for classifier error rates using maximum entropy empirical priors. Artif Intell Med 2010; 49:93-104. [DOI: 10.1016/j.artmed.2010.02.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2009] [Revised: 12/07/2009] [Accepted: 02/16/2010] [Indexed: 10/19/2022]
|
15
|
Espejo P, Ventura S, Herrera F. A Survey on the Application of Genetic Programming to Classification. ACTA ACUST UNITED AC 2010. [DOI: 10.1109/tsmcc.2009.2033566] [Citation(s) in RCA: 379] [Impact Index Per Article: 27.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
16
|
Liu KH, Li B, Wu QQ, Zhang J, Du JX, Liu GY. Microarray data classification based on ensemble independent component selection. Comput Biol Med 2009; 39:953-60. [PMID: 19716554 DOI: 10.1016/j.compbiomed.2009.07.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Revised: 01/06/2009] [Accepted: 07/14/2009] [Indexed: 11/26/2022]
Abstract
Independent component analysis (ICA) has been widely deployed to the analysis of microarray datasets. Although it was pointed out that after ICA transformation, different independent components (ICs) are of different biological significance, the IC selection problem is still far from fully explored. In this paper, we propose a genetic algorithm (GA) based ensemble independent component selection (EICS) system. In this system, GA is applied to select a set of optimal IC subsets, which are then used to build diverse and accurate base classifiers. Finally, all base classifiers are combined with majority vote rule. To show the validity of the proposed method, we apply it to classify three DNA microarray data sets involving various human normal and tumor tissue samples. The experimental results show that our ensemble method obtains stable and satisfying classification results when compared with several existing methods.
Collapse
Affiliation(s)
- Kun-Hong Liu
- Software School of Xiamen University, Xiamen, Fujian, 361005, China.
| | | | | | | | | | | |
Collapse
|
17
|
Okun O, Priisalu H. Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artif Intell Med 2009; 45:151-62. [DOI: 10.1016/j.artmed.2008.08.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2007] [Revised: 08/05/2008] [Accepted: 08/06/2008] [Indexed: 10/21/2022]
|
18
|
Liu KH, Xu CG. A genetic programming-based approach to the classification of multiclass microarray datasets. ACTA ACUST UNITED AC 2008; 25:331-7. [PMID: 19088122 DOI: 10.1093/bioinformatics/btn644] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
MOTIVATION Feature selection approaches have been widely applied to deal with the small sample size problem in the analysis of micro-array datasets. For the multiclass problem, the proposed methods are based on the idea of selecting a gene subset to distinguish all classes. However, it will be more effective to solve a multiclass problem by splitting it into a set of two-class problems and solving each problem with a respective classification system. RESULTS We propose a genetic programming (GP)-based approach to analyze multiclass microarray datasets. Unlike the traditional GP, the individual proposed in this article consists of a set of small-scale ensembles, named as sub-ensemble (denoted by SE). Each SE consists of a set of trees. In application, a multiclass problem is divided into a set of two-class problems, each of which is tackled by a SE first. The SEs tackling the respective two-class problems are combined to construct a GP individual, so each individual can deal with a multiclass problem directly. Effective methods are proposed to solve the problems arising in the fusion of SEs, and a greedy algorithm is designed to keep high diversity in SEs. This GP is tested in five datasets. The results show that the proposed method effectively implements the feature selection and classification tasks.
Collapse
Affiliation(s)
- Kun-Hong Liu
- School of Software, Xiamen University, Xiamen, Fujian, 361005, China.
| | | |
Collapse
|
19
|
|
20
|
Statistical data processing in clinical proteomics. J Chromatogr B Analyt Technol Biomed Life Sci 2008; 866:77-88. [DOI: 10.1016/j.jchromb.2007.10.042] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2007] [Revised: 10/17/2007] [Accepted: 10/18/2007] [Indexed: 01/12/2023]
|
21
|
Genetic Programming: An Introduction and Tutorial, with a Survey of Techniques and Applications. STUDIES IN COMPUTATIONAL INTELLIGENCE 2008. [DOI: 10.1007/978-3-540-78293-3_22] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
22
|
An integrated algorithm for gene selection and classification applied to microarray data of ovarian cancer. Artif Intell Med 2008; 42:81-93. [DOI: 10.1016/j.artmed.2007.09.004] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2007] [Revised: 09/27/2007] [Accepted: 09/27/2007] [Indexed: 01/02/2023]
|
23
|
Chen Z, Li J, Wei L. A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue. Artif Intell Med 2007; 41:161-75. [PMID: 17851055 DOI: 10.1016/j.artmed.2007.07.008] [Citation(s) in RCA: 52] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2006] [Revised: 07/31/2007] [Accepted: 07/31/2007] [Indexed: 10/22/2022]
Abstract
OBJECTIVE Recently, gene expression profiling using microarray techniques has been shown as a promising tool to improve the diagnosis and treatment of cancer. Gene expression data contain high level of noise and the overwhelming number of genes relative to the number of available samples. It brings out a great challenge for machine learning and statistic techniques. Support vector machine (SVM) has been successfully used to classify gene expression data of cancer tissue. In the medical field, it is crucial to deliver the user a transparent decision process. How to explain the computed solutions and present the extracted knowledge becomes a main obstacle for SVM. MATERIAL AND METHODS A multiple kernel support vector machine (MK-SVM) scheme, consisting of feature selection, rule extraction and prediction modeling is proposed to improve the explanation capacity of SVM. In this scheme, we show that the feature selection problem can be translated into an ordinary multiple parameters learning problem. And a shrinkage approach: 1-norm based linear programming is proposed to obtain the sparse parameters and the corresponding selected features. We propose a novel rule extraction approach using the information provided by the separating hyperplane and support vectors to improve the generalization capacity and comprehensibility of rules and reduce the computational complexity. RESULTS AND CONCLUSION Two public gene expression datasets: leukemia dataset and colon tumor dataset are used to demonstrate the performance of this approach. Using the small number of selected genes, MK-SVM achieves encouraging classification accuracy: more than 90% for both two datasets. Moreover, very simple rules with linguist labels are extracted. The rule sets have high diagnostic power because of their good classification performance.
Collapse
Affiliation(s)
- Zhenyu Chen
- Institute of Policy & Management, Chinese Academy of Sciences, Beijing 100080, China.
| | | | | |
Collapse
|
24
|
|
25
|
Yu J, Yu J, Almal AA, Dhanasekaran SM, Ghosh D, Worzel WP, Chinnaiyan AM. Feature selection and molecular classification of cancer using genetic programming. Neoplasia 2007; 9:292-303. [PMID: 17460773 PMCID: PMC1854845 DOI: 10.1593/neo.07121] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2007] [Revised: 02/20/2007] [Accepted: 02/22/2007] [Indexed: 11/18/2022]
Abstract
Despite important advances in microarray-based molecular classification of tumors, its application in clinical settings remains formidable. This is in part due to the limitation of current analysis programs in discovering robust biomarkers and developing classifiers with a practical set of genes. Genetic programming (GP) is a type of machine learning technique that uses evolutionary algorithm to simulate natural selection as well as population dynamics, hence leading to simple and comprehensible classifiers. Here we applied GP to cancer expression profiling data to select feature genes and build molecular classifiers by mathematical integration of these genes. Analysis of thousands of GP classifiers generated for a prostate cancer data set revealed repetitive use of a set of highly discriminative feature genes, many of which are known to be disease associated. GP classifiers often comprise five or less genes and successfully predict cancer types and subtypes. More importantly, GP classifiers generated in one study are able to predict samples from an independent study, which may have used different microarray platforms. In addition, GP yielded classification accuracy better than or similar to conventional classification methods. Furthermore, the mathematical expression of GP classifiers provides insights into relationships between classifier genes. Taken together, our results demonstrate that GP may be valuable for generating effective classifiers containing a practical set of genes for diagnostic/prognostic cancer classification.
Collapse
Affiliation(s)
- Jianjun Yu
- Bioinformatics Program, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | | | | | | | | | | | | |
Collapse
|