1
|
Mendonca-Neto R, Li Z, Fenyo D, Silva CT, Nakamura FG, Nakamura EF. A Gene Selection Method Based on Outliers for Breast Cancer Subtype Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2547-2559. [PMID: 34860652 DOI: 10.1109/tcbb.2021.3132339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Breast cancer is the second most common cancer type and is the leading cause of cancer-related deaths worldwide. Since it is a heterogeneous disease, subtyping breast cancer plays an important role in performing a specific treatment. Gene expression data is a viable alternative to be employed on cancer subtype classification, as they represent the state of a cell at the molecular level, but generally has a relatively small number of samples compared to a large number of genes. Gene selection is a promising approach that addresses this uneven high-dimensional matrix of genes versus samples and plays an important role in the development of efficient cancer subtype classification. In this work, an innovative outlier-based gene selection (OGS) method is proposed to select relevant genes for efficiently and effectively classify breast cancer subtypes. Experiments show that our strategy presents an F1 score of 1.0 for basal and 0.86 for her 2, the two subtypes with the worst prognoses, respectively. Compared to other methods, our proposed method outperforms in the F1 score using 80% less genes. In general, our method selects only a few highly relevant genes, speeding up the classification, and significantly improving the classifier's performance.
Collapse
|
2
|
Bose S, Das C, Banerjee A, Ghosh K, Chattopadhyay M, Chattopadhyay S, Barik A. An ensemble machine learning model based on multiple filtering and supervised attribute clustering algorithm for classifying cancer samples. PeerJ Comput Sci 2021; 7:e671. [PMID: 34616883 PMCID: PMC8459790 DOI: 10.7717/peerj-cs.671] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 07/20/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis. METHODS In this article, a new ensemble machine learning classification model named Multiple Filtering and Supervised Attribute Clustering algorithm based Ensemble Classification model (MFSAC-EC) is proposed which can handle class imbalance problem and high dimensionality of microarray datasets. This model first generates a number of bootstrapped datasets from the original training data where the oversampling procedure is applied to handle the class imbalance problem. The proposed MFSAC method is then applied to each of these bootstrapped datasets to generate sub-datasets, each of which contains a subset of the most relevant/informative attributes of the original dataset. The MFSAC method is a feature selection technique combining multiple filters with a new supervised attribute clustering algorithm. Then for every sub-dataset, a base classifier is constructed separately, and finally, the predictive accuracy of these base classifiers is combined using the majority voting technique forming the MFSAC-based ensemble classifier. Also, a number of most informative attributes are selected as important features based on their frequency of occurrence in these sub-datasets. RESULTS To assess the performance of the proposed MFSAC-EC model, it is applied on different high-dimensional microarray gene expression datasets for cancer sample classification. The proposed model is compared with well-known existing models to establish its effectiveness with respect to other models. From the experimental results, it has been found that the generalization performance/testing accuracy of the proposed classifier is significantly better compared to other well-known existing models. Apart from that, it has been also found that the proposed model can identify many important attributes/biomarker genes.
Collapse
Affiliation(s)
- Shilpi Bose
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Chandra Das
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Abhik Banerjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| | - Kuntal Ghosh
- Machine Intelligence Unit & Center for Soft Computing Research, Indian Statistical Institute, Kolkata, West Bengal, India
| | | | - Samiran Chattopadhyay
- Department of Information Technology, Jadavpur University, Kolkata, West Bengal, India
| | - Aishwarya Barik
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Kolkata, West Bengal, India
| |
Collapse
|
3
|
Das J, Barman Mandal S. Identification of Homo sapiens cancer classes based on fusion of hidden gene features. J Biomed Inform 2020; 110:103555. [PMID: 32916304 DOI: 10.1016/j.jbi.2020.103555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Revised: 07/08/2020] [Accepted: 09/02/2020] [Indexed: 10/23/2022]
Abstract
Classification of Homo sapiens cancer genes in molecular level is a challenging research issue as they are extremely pseudo random in nature. Signature gene features need to be exposed to distinctly identify the gene class. Tree-structured filter bank is chosen to perform feature extraction and dimension reduction of the genes. Extracted gene features are fused using Gaussian mixture probability distribution function and identify different cancer classes depending on amount of correlation and exploiting maximum likelihood function. The algorithm is tested on 161 sample gene data of 7 different cancer classes. Sensitivity, specificity, accuracy, precision and F-score are used as metrics to judge the performance of the system and ROC is plotted in comparison with existing electrical network model based classifier. The proposed classifier can identify more than stated number of cancer classes which is a major limitation of the existing electrical network based method. The proposed algorithm is validated by comparing the results with other seven existing image processing based methods.
Collapse
Affiliation(s)
- Joyshri Das
- Institute of Radio Physics & Electronics, University of Calcutta, India.
| | | |
Collapse
|
4
|
A mapping study of ensemble classification methods in lung cancer decision support systems. Med Biol Eng Comput 2020; 58:2177-2193. [PMID: 32621068 DOI: 10.1007/s11517-020-02223-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Accepted: 06/25/2020] [Indexed: 10/23/2022]
Abstract
Achieving a high level of classification accuracy in medical datasets is a capital need for researchers to provide effective decision systems to assist doctors in work. In many domains of artificial intelligence, ensemble classification methods are able to improve the performance of single classifiers. This paper reports the state of the art of ensemble classification methods in lung cancer detection. We have performed a systematic mapping study to identify the most interesting papers concerning this topic. A total of 65 papers published between 2000 and 2018 were selected after an automatic search in four digital libraries and a careful selection process. As a result, it was observed that diagnosis was the task most commonly studied; homogeneous ensembles and decision trees were the most frequently adopted for constructing ensembles; and the majority voting rule was the predominant combination rule. Few studies considered the parameter tuning of the techniques used. These findings open several perspectives for researchers to enhance lung cancer research by addressing the identified gaps, such as investigating different classification methods, proposing other heterogeneous ensemble methods, and using new combination rules. Graphical abstract Main features of the mapping study performed in ensemble classification methods applied on lung cancer decision support systems.
Collapse
|
5
|
A concise peephole model based transfer learning method for small sample temporal feature-based data-driven quality analysis. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105665] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
6
|
Menaga D, Revathi S. AN EMPIRICAL STUDY OF CANCER CLASSIFICATION TECHNIQUES BASED ON THE NEURAL NETWORKS. BIOMEDICAL ENGINEERING: APPLICATIONS, BASIS AND COMMUNICATIONS 2020. [DOI: 10.4015/s1016237220500131] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/09/2022]
Abstract
Cancer is one of the most common dreadful diseases prevailing worldwide, and patients with cancer are rescued only when the cancer is detected at a very early stage. Early detection of cancer is appropriate as in the fourth stage, but the chance of survival is limited. The symptoms of cancers are rigorous, and therefore, all the symptoms should be studied properly before the diagnosis. Thus, an automatic prediction system is necessary for classifying the tumor, i.e. malignant or benign tumor. Over the past few years, cancer classification is increased rapidly, but there is no general technique to find novel cancer classes (class discovery) or to assign tumors to known classes. Accordingly, this survey analyzes distinct cancer classification techniques. Thus, this review article provides a detailed review of 50 research papers presenting the suggested cancer classification techniques, like Deep learning-based techniques, Neural network-based techniques, and Hybrid techniques. Moreover, an elaborative analysis and discussion are made based on the year of publication, utilized datasets, accuracy range, evaluation metrics, implementation tool, and adopted classification methods. Eventually, the research gaps and issues of various cancer classification schemes are presented for extending the researchers towards a better future scope.
Collapse
Affiliation(s)
- D. Menaga
- B.S. Abdur Rahman Crescent Institute of Science and Technology, Seethakathi Estate G.S.T Main Road Vandalur, Chennai, Tamil Nadu 600048, India
| | - S. Revathi
- B.S. Abdur Rahman Crescent Institute of Science and Technology, Seethakathi Estate G.S.T Main Road Vandalur, Chennai, Tamil Nadu 600048, India
| |
Collapse
|
7
|
Hosni M, Carrillo-de-Gea JM, Idri A, Fernandez-Aleman JL, Garcia-Berna JA. Using ensemble classification methods in lung cancer disease. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2019:1367-1370. [PMID: 31946147 DOI: 10.1109/embc.2019.8857435] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This paper presents an overview of the use of ensemble classification methods in the lung cancer disease. An analysis is carried out according to seven aspects: publication trends, channels and venues; medical tasks tackled; ensemble types proposed; single techniques used to construct the ensemble methods; rules used to draw the output of the ensemble; datasets used to build and evaluate the ensemble methods; and tools used. The application of ensemble methods in lung cancer disease started in 2003. The diagnosis task was the most tackled one by researchers. Furthermore, the homogeneous ensembles were the most frequent in the literature, and decision tree techniques were the most adopted ones for constructing ensembles. Several datasets related to the lung cancer disease were used to build and assess the ensemble methods. The most used tool was Weka. To conclude, some recommendations for future research are: tackle the medical tasks not investigated in the literature by means of ensemble methods; investigate other classification methods; propose other heterogeneous ensemble methods; and use other combination rules.
Collapse
|
8
|
Wang W, Xie G, Ren Z, Xie T, Li J. Gene Selection for the Discrimination of Colorectal Cancer. Curr Mol Med 2019; 20:415-428. [PMID: 31746296 DOI: 10.2174/1566524019666191119105209] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Revised: 10/29/2019] [Accepted: 10/31/2019] [Indexed: 12/15/2022]
Abstract
BACKGROUND Colorectal cancer (CRC) is the third most common cancer worldwide. Cancer discrimination is a typical application of gene expression analysis using a microarray technique. However, microarray data suffer from the curse of dimensionality and usual imbalanced class distribution between the majority (tumor samples) and minority (normal samples) classes. Feature gene selection is necessary and important for cancer discrimination. OBJECTIVES To select feature genes for the discrimination of CRC. METHODS We improve the feature selection algorithm based on differential evolution, DEFSw by using RUSBoost classifier and weight accuracy instead of the common classifier and evaluation measure for selecting feature genes from imbalance data. We firstly extract differently expressed genes (DEGs) from the CRC dataset of the TCGA and then select the feature genes from the DEGs using the improved DEFSw algorithm. Finally, we validate the selected feature gene sets using independent datasets and retrieve the cancer related information for these genes based on text mining through the Coremine Medical online database. RESULTS We select out 16 single-gene feature sets for colorectal cancer discrimination and 19 single-gene feature sets only for colon cancer discrimination. CONCLUSIONS In summary, we find a series of high potential candidate biomarkers or signatures, which can discriminate either or both of colon cancer and rectal cancer with high sensitivity and specificity.
Collapse
Affiliation(s)
- Wenhui Wang
- Network Information Center, The Sixth Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China.,National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China.,Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| | - Guanglei Xie
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| | - Zhonglu Ren
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China
| | - Tingyan Xie
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| | - Jinming Li
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| |
Collapse
|
9
|
Hosni M, Abnane I, Idri A, Carrillo de Gea JM, Fernández Alemán JL. Reviewing ensemble classification methods in breast cancer. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 177:89-112. [PMID: 31319964 DOI: 10.1016/j.cmpb.2019.05.019] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Revised: 05/16/2019] [Accepted: 05/18/2019] [Indexed: 05/09/2023]
Abstract
CONTEXT Ensemble methods consist of combining more than one single technique to solve the same task. This approach was designed to overcome the weaknesses of single techniques and consolidate their strengths. Ensemble methods are now widely used to carry out prediction tasks (e.g. classification and regression) in several fields, including that of bioinformatics. Researchers have particularly begun to employ ensemble techniques to improve research into breast cancer, as this is the most frequent type of cancer and accounts for most of the deaths among women. OBJECTIVE AND METHOD The goal of this study is to analyse the state of the art in ensemble classification methods when applied to breast cancer as regards 9 aspects: publication venues, medical tasks tackled, empirical and research types adopted, types of ensembles proposed, single techniques used to construct the ensembles, validation framework adopted to evaluate the proposed ensembles, tools used to build the ensembles, and optimization methods used for the single techniques. This paper was undertaken as a systematic mapping study. RESULTS A total of 193 papers that were published from the year 2000 onwards, were selected from four online databases: IEEE Xplore, ACM digital library, Scopus and PubMed. This study found that of the six medical tasks that exist, the diagnosis medical task was that most frequently researched, and that the experiment-based empirical type and evaluation-based research type were the most dominant approaches adopted in the selected studies. The homogeneous type was that most widely used to perform the classification task. With regard to single techniques, this mapping study found that decision trees, support vector machines and artificial neural networks were those most frequently adopted to build ensemble classifiers. In the case of the evaluation framework, the Wisconsin Breast Cancer dataset was the most frequently used by researchers to perform their experiments, while the most noticeable validation method was k-fold cross-validation. Several tools are available to perform experiments related to ensemble classification methods, such as Weka and R Software. Few researchers took into account the optimisation of the single technique of which their proposed ensemble was composed, while the grid search method was that most frequently adopted to tune the parameter settings of a single classifier. CONCLUSION This paper reports an in-depth study of the application of ensemble methods as regards breast cancer. Our results show that there are several gaps and issues and we, therefore, provide researchers in the field of breast cancer research with recommendations. Moreover, after analysing the papers found in this systematic mapping study, we discovered that the majority report positive results concerning the accuracy of ensemble classifiers when compared to the single classifiers. In order to aggregate the evidence reported in literature, it will, therefore, be necessary to perform a systematic literature review and meta-analysis in which an in-depth analysis could be conducted so as to confirm the superiority of ensemble classifiers over the classical techniques.
Collapse
Affiliation(s)
- Mohamed Hosni
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Ibtissam Abnane
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Ali Idri
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - Juan M Carrillo de Gea
- Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Spain.
| | | |
Collapse
|
10
|
|
11
|
Li J, Dong W, Meng D. Grouped Gene Selection of Cancer via Adaptive Sparse Group Lasso Based on Conditional Mutual Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:2028-2038. [PMID: 29028206 DOI: 10.1109/tcbb.2017.2761871] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This paper deals with the problems of cancer classification and grouped gene selection. The weighted gene co-expression network on cancer microarray data is employed to identify modules corresponding to biological pathways, based on which a strategy of dividing genes into groups is presented. Using the conditional mutual information within each divided group, an integrated criterion is proposed and the data-driven weights are constructed. They are shown with the ability to evaluate both the individual gene significance and the influence to improve correlation of all the other pairwise genes in each group. Furthermore, an adaptive sparse group lasso is proposed, by which an improved blockwise descent algorithm is developed. The results on four cancer data sets demonstrate that the proposed adaptive sparse group lasso can effectively perform classification and grouped gene selection.
Collapse
|
12
|
Fronto-parietal numerical networks in relation with early numeracy in young children. Brain Struct Funct 2018; 224:263-275. [PMID: 30315414 DOI: 10.1007/s00429-018-1774-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Accepted: 10/05/2018] [Indexed: 10/28/2022]
Abstract
Early numeracy provides the foundation of acquiring mathematical skills that is essential for future academic success. This study examined numerical functional networks in relation to counting and number relational skills in preschoolers at 4 and 6 years of age. The counting and number relational skills were assessed using school readiness test (SRT). Resting-state fMRI (rs-fMRI) was acquired in 123 4-year-olds and 146 6-year-olds. Among them, 61 were scanned twice over the course of 2 years. Meta-analysis on existing task-based numeracy fMRI studies identified the left parietal-dominant network for both counting and number relational skills and the right parietal-dominant network only for number relational skills in adults. We showed that the fronto-parietal numerical networks, observed in adults, already exist in 4-year and 6-year-olds. The counting skills were associated with the bilateral fronto-parietal network in 4-year-olds and with the right parietal-dominant network in 6-year-olds. Moreover, the number relational skills were related to the bilateral fronto-parietal and right parietal-dominant networks in 4-year-olds and had a trend of the significant relationship with the right parietal-dominant network in 6-year-olds. Our findings suggested that neural fine-tuning of the fronto-parietal numerical networks may subserve the maturation of numeracy in early childhood.
Collapse
|
13
|
Ye Q, Zhao H, Li Z, Yang X, Gao S, Yin T, Ye N. L1-Norm Distance Minimization-Based Fast Robust Twin Support Vector $k$ -Plane Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:4494-4503. [PMID: 28981431 DOI: 10.1109/tnnls.2017.2749428] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Twin support vector clustering (TWSVC) is a recently proposed powerful k-plane clustering method. It, however, is prone to outliers due to the utilization of squared L2-norm distance. Besides, TWSVC is computationally expensive, attributing to the need of solving a series of constrained quadratic programming problems (CQPPs) in learning each clustering plane. To address these problems, this brief first develops a new k-plane clustering method called L1-norm distance minimization-based robust TWSVC by using robust L1-norm distance. To achieve this objective, we propose a novel iterative algorithm. In each iteration of the algorithm, one CQPP is solved. To speed up the computation of TWSVC and simultaneously inherit the merit of robustness, we further propose Fast RTWSVC and design an effective iterative algorithm to optimize it. Only a system of linear equations needs to be computed in each iteration. These characteristics make our methods more powerful and efficient than TWSVC. We also conduct some insightful analysis on the existence of local minimum and the convergence of the proposed algorithms. Theoretical insights and effectiveness of our methods are further supported by promising experimental results.
Collapse
|
14
|
Piątek Ł, Grzymała-Busse JW. LEMRG: Decision Rule Generation Algorithm for Mining MicroRNA Expression Data. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2017; 1028:105-137. [DOI: 10.1007/978-981-10-6041-0_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
|
15
|
Random Subspace Aggregation for Cancer Prediction with Gene Expression Profiles. BIOMED RESEARCH INTERNATIONAL 2016; 2016:4596326. [PMID: 27999797 PMCID: PMC5143691 DOI: 10.1155/2016/4596326] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2016] [Revised: 10/08/2016] [Accepted: 10/20/2016] [Indexed: 12/23/2022]
Abstract
Background. Precisely predicting cancer is crucial for cancer treatment. Gene expression profiles make it possible to analyze patterns between genes and cancers on the genome-wide scale. Gene expression data analysis, however, is confronted with enormous challenges for its characteristics, such as high dimensionality, small sample size, and low Signal-to-Noise Ratio. Results. This paper proposes a method, termed RS_SVM, to predict gene expression profiles via aggregating SVM trained on random subspaces. After choosing gene features through statistical analysis, RS_SVM randomly selects feature subsets to yield random subspaces and training SVM classifiers accordingly and then aggregates SVM classifiers to capture the advantage of ensemble learning. Experiments on eight real gene expression datasets are performed to validate the RS_SVM method. Experimental results show that RS_SVM achieved better classification accuracy and generalization performance in contrast with single SVM, K-nearest neighbor, decision tree, Bagging, AdaBoost, and the state-of-the-art methods. Experiments also explored the effect of subspace size on prediction performance. Conclusions. The proposed RS_SVM method yielded superior performance in analyzing gene expression profiles, which demonstrates that RS_SVM provides a good channel for such biological data.
Collapse
|
16
|
Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:971-989. [PMID: 26390495 DOI: 10.1109/tcbb.2015.2478454] [Citation(s) in RCA: 185] [Impact Index Per Article: 23.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Recently, feature selection and dimensionality reduction have become fundamental tools for many data mining tasks, especially for processing high-dimensional data such as gene expression microarray data. Gene expression microarray data comprises up to hundreds of thousands of features with relatively small sample size. Because learning algorithms usually do not work well with this kind of data, a challenge to reduce the data dimensionality arises. A huge number of gene selection are applied to select a subset of relevant features for model construction and to seek for better cancer classification performance. This paper presents the basic taxonomy of feature selection, and also reviews the state-of-the-art gene selection methods by grouping the literatures into three categories: supervised, unsupervised, and semi-supervised. The comparison of experimental results on top 5 representative gene expression datasets indicates that the classification accuracy of unsupervised and semi-supervised feature selection is competitive with supervised feature selection.
Collapse
|
17
|
Wang L, Wang Y, Chang Q. Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 2016; 111:21-31. [PMID: 27592382 DOI: 10.1016/j.ymeth.2016.08.014] [Citation(s) in RCA: 110] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Revised: 08/25/2016] [Accepted: 08/30/2016] [Indexed: 11/26/2022] Open
Abstract
This paper surveys main principles of feature selection and their recent applications in big data bioinformatics. Instead of the commonly used categorization into filter, wrapper, and embedded approaches to feature selection, we formulate feature selection as a combinatorial optimization or search problem and categorize feature selection methods into exhaustive search, heuristic search, and hybrid methods, where heuristic search methods may further be categorized into those with or without data-distilled feature ranking measures.
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
| | - Yaoli Wang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| | - Qing Chang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| |
Collapse
|
18
|
Garro BA, Rodríguez K, Vázquez RA. Classification of DNA microarrays using artificial neural networks and ABC algorithm. Appl Soft Comput 2016. [DOI: 10.1016/j.asoc.2015.10.002] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
19
|
Yu Z, Li L, Liu J, Han G. Hybrid adaptive classifier ensemble. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:177-190. [PMID: 24860045 DOI: 10.1109/tcyb.2014.2322195] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Traditional random subspace-based classifier ensemble approaches (RSCE) have several limitations, such as viewing the same importance for the base classifiers trained in different subspaces, not considering how to find the optimal random subspace set. In this paper, we design a general hybrid adaptive ensemble learning framework (HAEL), and apply it to address the limitations of RSCE. As compared with RSCE, HAEL consists of two adaptive processes, i.e., base classifier competition and classifier ensemble interaction, so as to adjust the weights of the base classifiers in each ensemble and to explore the optimal random subspace set simultaneously. The experiments on the real-world datasets from the KEEL dataset repository for the classification task and the cancer gene expression profiles show that: 1) HAEL works well on both the real-world KEEL datasets and the cancer gene expression profiles and 2) it outperforms most of the state-of-the-art classifier ensemble approaches on 28 out of 36 KEEL datasets and 6 out of 6 cancer datasets.
Collapse
|
20
|
Majid A, Ali S, Iqbal M, Kausar N. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2014; 113:792-808. [PMID: 24472367 DOI: 10.1016/j.cmpb.2014.01.001] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/06/2013] [Revised: 12/29/2013] [Accepted: 01/03/2014] [Indexed: 06/03/2023]
Abstract
This study proposes a novel prediction approach for human breast and colon cancers using different feature spaces. The proposed scheme consists of two stages: the preprocessor and the predictor. In the preprocessor stage, the mega-trend diffusion (MTD) technique is employed to increase the samples of the minority class, thereby balancing the dataset. In the predictor stage, machine-learning approaches of K-nearest neighbor (KNN) and support vector machines (SVM) are used to develop hybrid MTD-SVM and MTD-KNN prediction models. MTD-SVM model has provided the best values of accuracy, G-mean and Matthew's correlation coefficient of 96.71%, 96.70% and 71.98% for cancer/non-cancer dataset, breast/non-breast cancer dataset and colon/non-colon cancer dataset, respectively. We found that hybrid MTD-SVM is the best with respect to prediction performance and computational cost. MTD-KNN model has achieved moderately better prediction as compared to hybrid MTD-NB (Naïve Bayes) but at the expense of higher computing cost. MTD-KNN model is faster than MTD-RF (random forest) but its prediction is not better than MTD-RF. To the best of our knowledge, the reported results are the best results, so far, for these datasets. The proposed scheme indicates that the developed models can be used as a tool for the prediction of cancer. This scheme may be useful for study of any sequential information such as protein sequence or any nucleic acid sequence.
Collapse
Affiliation(s)
- Abdul Majid
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Safdar Ali
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Mubashar Iqbal
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Nabeela Kausar
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| |
Collapse
|
21
|
Recognition of multiple imbalanced cancer types based on DNA microarray data using ensemble classifiers. BIOMED RESEARCH INTERNATIONAL 2013; 2013:239628. [PMID: 24078908 PMCID: PMC3770038 DOI: 10.1155/2013/239628] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2013] [Revised: 07/08/2013] [Accepted: 07/17/2013] [Indexed: 11/24/2022]
Abstract
DNA microarray technology can measure the activities of tens of thousands of genes simultaneously, which provides an efficient way to diagnose cancer at the molecular level. Although this strategy has attracted significant research attention, most studies neglect an important problem, namely, that most DNA microarray datasets are skewed, which causes traditional learning algorithms to produce inaccurate results. Some studies have considered this problem, yet they merely focus on binary-class problem. In this paper, we dealt with multiclass imbalanced classification problem, as encountered in cancer DNA microarray, by using ensemble learning. We utilized one-against-all coding strategy to transform multiclass to multiple binary classes, each of them carrying out feature subspace, which is an evolving version of random subspace that generates multiple diverse training subsets. Next, we introduced one of two different correction technologies, namely, decision threshold adjustment or random undersampling, into each training subset to alleviate the damage of class imbalance. Specifically, support vector machine was used as base classifier, and a novel voting rule called counter voting was presented for making a final decision. Experimental results on eight skewed multiclass cancer microarray datasets indicate that unlike many traditional classification approaches, our methods are insensitive to class imbalance.
Collapse
|
22
|
Sarkar A, Maulik U. Cancer Gene Expression Data Analysis Using Rough Based Symmetrical Clustering. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Identification of cancer subtypes is the central goal in the cancer gene expression data analysis. Modified symmetry-based clustering is an unsupervised learning technique for detecting symmetrical convex or non-convex shaped clusters. To enable fast automatic clustering of cancer tissues (samples), in this chapter, the authors propose a rough set based hybrid approach for modified symmetry-based clustering algorithm. A natural basis for analyzing gene expression data using the symmetry-based algorithm is to group together genes with similar symmetrical patterns of microarray expressions. Rough-set theory helps in faster convergence and initial automatic optimal classification, thereby solving the problem of unknown knowledge of number of clusters in gene expression measurement data. For rough-set-theoretic decision rule generation, each cluster is classified using heuristically searched optimal reducts to overcome overlapping cluster problem. The rough modified symmetry-based clustering algorithm is compared with another newly implemented rough-improved symmetry-based clustering algorithm and existing K-Means algorithm over five benchmark cancer gene expression data sets, to demonstrate its superiority in terms of validity. The statistical analyses are also performed to establish the significance of this rough modified symmetry-based clustering approach.
Collapse
Affiliation(s)
- Anasua Sarkar
- Government College of Engineering and Leather Technology, India
| | | |
Collapse
|
23
|
Wang N, Su L, Tang J, Ye A. Informative gene selection using the Algebraic Connectivity Strength of Point and Scoring Criteria. CHINESE SCIENCE BULLETIN-CHINESE 2013. [DOI: 10.1007/s11434-012-5421-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
24
|
Shao YH, Deng NY, Yang ZM, Chen WJ, Wang Z. Probabilistic outputs for twin support vector machines. Knowl Based Syst 2012. [DOI: 10.1016/j.knosys.2012.04.006] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
25
|
Nanni L, Brahnam S, Lumini A. Combining multiple approaches for gene microarray classification. Bioinformatics 2012; 28:1151-7. [DOI: 10.1093/bioinformatics/bts108] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
26
|
|
27
|
Ghorai S, Mukherjee A, Dutta PK. Discriminant Analysis for Fast Multiclass Data Classification Through Regularized Kernel Function Approximation. ACTA ACUST UNITED AC 2010; 21:1020-9. [DOI: 10.1109/tnn.2010.2046646] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|