1
|
Gokhale M, Mohanty SK, Ojha A. A stacked autoencoder based gene selection and cancer classification framework. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
2
|
Mendonca-Neto R, Li Z, Fenyo D, Silva CT, Nakamura FG, Nakamura EF. A Gene Selection Method Based on Outliers for Breast Cancer Subtype Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2547-2559. [PMID: 34860652 DOI: 10.1109/tcbb.2021.3132339] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Breast cancer is the second most common cancer type and is the leading cause of cancer-related deaths worldwide. Since it is a heterogeneous disease, subtyping breast cancer plays an important role in performing a specific treatment. Gene expression data is a viable alternative to be employed on cancer subtype classification, as they represent the state of a cell at the molecular level, but generally has a relatively small number of samples compared to a large number of genes. Gene selection is a promising approach that addresses this uneven high-dimensional matrix of genes versus samples and plays an important role in the development of efficient cancer subtype classification. In this work, an innovative outlier-based gene selection (OGS) method is proposed to select relevant genes for efficiently and effectively classify breast cancer subtypes. Experiments show that our strategy presents an F1 score of 1.0 for basal and 0.86 for her 2, the two subtypes with the worst prognoses, respectively. Compared to other methods, our proposed method outperforms in the F1 score using 80% less genes. In general, our method selects only a few highly relevant genes, speeding up the classification, and significantly improving the classifier's performance.
Collapse
|
3
|
Urban Sustainability: Integrating Socioeconomic and Environmental Data for Multi-Objective Assessment. SUSTAINABILITY 2022. [DOI: 10.3390/su14159142] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The large concentration of the world’s population in cities, along with rapid urbanization, have brought numerous environmental and socioeconomic challenges to sustainable urban systems (SUS). However, current SUS studies focus heavily on ecological aspects, rely on SUS indicators that are not supported by available data, lack comprehensive analytical frameworks, and neglect SUS regional differences. This paper develops a novel approach to assessing urban sustainability from regional perspectives using commonly enumerated socioeconomic statistics. It integrates land use and land cover change data and ecosystem service values, applies data mining analytics to derive SUS indicators, and evaluates SUS states as trade-offs among relevant SUS indicators. This synthetic approach is called the integrated socioeconomic and land-use data mining–based multi-objective assessment (ISL-DM-MOA). The paper presents a case study of urban sustainability development in cities and counties in Inner Mongolia, China, which face many environmental and sustainable development problems. The case study identifies two SUS types: (1) several large cities that boast well-developed economies, diversified industrial sectors, vital transportation locations, good living conditions, and cleaner environments; and (2) a few small counties that have a small population, small urban construction areas, extensive natural grasslands, and primary grazing economies. The ISL-DM-MOA framework innovatively synthesizes currently available socioeconomic statistics and environmental data as a unified dataset to assess urban sustainability as a total socio-environmental system. ISL-DM-MOA deviates from the current indicator approach and advocates the notion of a data-mining-driven approach to derive urban sustainability dimensions. Furthermore, ISL-DM-MOA diverges from the concept of a composite score for determining urban sustainability. Instead, it promotes the concept of Pareto Front as a choice set of sustainability candidates, because sustainability varies among nations, regions, and locations and differs between political, economic, environmental, and cultural systems.
Collapse
|
4
|
Mahfouz MA, Nepomuceno JA. Graph coloring for extracting discriminative genes in cancer data. Ann Hum Genet 2019; 83:141-159. [PMID: 30644085 DOI: 10.1111/ahg.12297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Revised: 10/12/2018] [Accepted: 11/15/2018] [Indexed: 11/29/2022]
Abstract
BACKGROUND AND OBJECTIVE The major difficulty of the analysis of the input gene expression data in a microarray-based approach for an automated diagnosis of cancer is the large number of genes (high dimensionality) with many irrelevant genes (noise) compared to the very small number of samples. This research study tackles the dimensionality reduction challenge in this area. METHODS This research study introduces a dimension-reduction technique termed graph coloring approach (GCA) for microarray data-based cancer classification based on analyzing the absolute correlation between gene-gene pairs and partitioning genes into several hubs using graph coloring. GCA starts by a gene-selection step in which top relevant genes are selected using a biserial correlation. Each time, a gene from an ordered list of top relevant genes is selected as the hub gene (representative) and redundant genes are added to its group; the process is repeated recursively for the remaining genes. A gene is considered redundant if its absolute correlation with the hub gene is greater than a controlling threshold. A suitable range for the threshold is estimated by computing a percentage graph for the absolute correlation between gene-gene pairs. Each value in the estimated range for the threshold can efficiently produce a new feature subset. RESULTS GCA achieved significant improvement over several existing techniques in terms of higher accuracy and a smaller number of features. Also, genes selected by this method are relevant genes according to the information stored in scientific repositories. CONCLUSIONS The proposed dimension-reduction technique can help biologists accurately predict cancer in several areas of the body.
Collapse
Affiliation(s)
- Mohamed A Mahfouz
- Department of Computer and Systems Engineering, Faculty of Engineering, Alexandria University, Alexandria, Egypt
| | - Juan A Nepomuceno
- Departmento de Lenguajes y Sistemas Informáticos, Higher Technical School of Computer Engineering, University of Seville, Seville, Spain
| |
Collapse
|
5
|
Prediction of Periventricular Leukomalacia in Neonates after Cardiac Surgery Using Machine Learning Algorithms. J Med Syst 2018; 42:177. [DOI: 10.1007/s10916-018-1029-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Accepted: 08/02/2018] [Indexed: 10/28/2022]
|
6
|
Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:971-989. [PMID: 26390495 DOI: 10.1109/tcbb.2015.2478454] [Citation(s) in RCA: 186] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Recently, feature selection and dimensionality reduction have become fundamental tools for many data mining tasks, especially for processing high-dimensional data such as gene expression microarray data. Gene expression microarray data comprises up to hundreds of thousands of features with relatively small sample size. Because learning algorithms usually do not work well with this kind of data, a challenge to reduce the data dimensionality arises. A huge number of gene selection are applied to select a subset of relevant features for model construction and to seek for better cancer classification performance. This paper presents the basic taxonomy of feature selection, and also reviews the state-of-the-art gene selection methods by grouping the literatures into three categories: supervised, unsupervised, and semi-supervised. The comparison of experimental results on top 5 representative gene expression datasets indicates that the classification accuracy of unsupervised and semi-supervised feature selection is competitive with supervised feature selection.
Collapse
|
7
|
Wang L, Wang Y, Chang Q. Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 2016; 111:21-31. [PMID: 27592382 DOI: 10.1016/j.ymeth.2016.08.014] [Citation(s) in RCA: 110] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Revised: 08/25/2016] [Accepted: 08/30/2016] [Indexed: 11/26/2022] Open
Abstract
This paper surveys main principles of feature selection and their recent applications in big data bioinformatics. Instead of the commonly used categorization into filter, wrapper, and embedded approaches to feature selection, we formulate feature selection as a combinatorial optimization or search problem and categorize feature selection methods into exhaustive search, heuristic search, and hybrid methods, where heuristic search methods may further be categorized into those with or without data-distilled feature ranking measures.
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
| | - Yaoli Wang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| | - Qing Chang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| |
Collapse
|
8
|
Yang J, Zhou J, Zhu Z, Ma X, Ji Z. Iterative ensemble feature selection for multiclass classification of imbalanced microarray data. ACTA ACUST UNITED AC 2016; 23:13. [PMID: 27437198 PMCID: PMC4943507 DOI: 10.1186/s40709-016-0045-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND Microarray technology allows biologists to monitor expression levels of thousands of genes among various tumor tissues. Identifying relevant genes for sample classification of various tumor types is beneficial to clinical studies. One of the most widely used classification strategies for multiclass classification data is the One-Versus-All (OVA) schema that divides the original problem into multiple binary classification of one class against the rest. Nevertheless, multiclass microarray data tend to suffer from imbalanced class distribution between majority and minority classes, which inevitably deteriorates the performance of the OVA classification. RESULTS In this study, we propose a novel iterative ensemble feature selection (IEFS) framework for multiclass classification of imbalanced microarray data. In particular, filter feature selection and balanced sampling are performed iteratively and alternatively to boost the performance of each binary classification in the OVA schema. The proposed framework is tested and compared with other representative state-of-the-art filter feature selection methods using six benchmark multiclass microarray data sets. The experimental results show that IEFS framework provides superior or comparable performance to the other methods in terms of both classification accuracy and area under receiver operating characteristic curve. The more number of classes the data have, the better performance of IEFS framework achieves. CONCLUSIONS Balanced sampling and feature selection together work well in improving the performance of multiclass classification of imbalanced microarray data. The IEFS framework is readily applicable to other biological data analysis tasks facing the same problem.
Collapse
Affiliation(s)
- Junshan Yang
- College of Engineering and Information, Shenzhen University, Shenzhen, People's Republic of China
| | - Jiarui Zhou
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, People's Republic of China
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, People's Republic of China
| | - Xiaoliang Ma
- College of Engineering and Information, Shenzhen University, Shenzhen, People's Republic of China
| | - Zhen Ji
- College of Engineering and Information, Shenzhen University, Shenzhen, People's Republic of China
| |
Collapse
|
9
|
Chinnaswamy A, Srinivasan R. Hybrid Feature Selection Using Correlation Coefficient and Particle Swarm Optimization on Microarray Gene Expression Data. ADVANCES IN INTELLIGENT SYSTEMS AND COMPUTING 2016. [DOI: 10.1007/978-3-319-28031-8_20] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
10
|
Camacho-Cáceres KI, Acevedo-Díaz JC, Pérez-Marty LM, Ortiz M, Irizarry J, Cabrera-Ríos M, Isaza CE. Multiple criteria optimization joint analyses of microarray experiments in lung cancer: from existing microarray data to new knowledge. Cancer Med 2015; 4:1884-900. [PMID: 26471143 PMCID: PMC4940807 DOI: 10.1002/cam4.540] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2015] [Revised: 07/30/2015] [Accepted: 07/14/2015] [Indexed: 12/14/2022] Open
Abstract
Microarrays can provide large amounts of data for genetic relative expression in illnesses of interest such as cancer in short time. These data, however, are stored and often times abandoned when new experimental technologies arrive. This work reexamines lung cancer microarray data with a novel multiple criteria optimization‐based strategy aiming to detect highly differentially expressed genes. This strategy does not require any adjustment of parameters by the user and is capable to handle multiple and incommensurate units across microarrays. In the analysis, groups of samples from patients with distinct smoking habits (never smoker, current smoker) and different gender are contrasted to elicit sets of highly differentially expressed genes, several of which are already associated to lung cancer and other types of cancer. The list of genes is provided with a discussion of their role in cancer, as well as the possible research directions for each of them.
Collapse
Affiliation(s)
- Katia I Camacho-Cáceres
- Bio IE Lab, The Applied Optimization Group, Industrial Engineering Department, University of Puerto Rico, Mayaguez, Puerto Rico
| | - Juan C Acevedo-Díaz
- Bio IE Lab, The Applied Optimization Group, Industrial Engineering Department, University of Puerto Rico, Mayaguez, Puerto Rico
| | - Lynn M Pérez-Marty
- Bio IE Lab, The Applied Optimization Group, Industrial Engineering Department, University of Puerto Rico, Mayaguez, Puerto Rico
| | - Michael Ortiz
- Bio IE Lab, The Applied Optimization Group, Industrial Engineering Department, University of Puerto Rico, Mayaguez, Puerto Rico
| | - Juan Irizarry
- Bio IE Lab, The Applied Optimization Group, Industrial Engineering Department, University of Puerto Rico, Mayaguez, Puerto Rico
| | - Mauricio Cabrera-Ríos
- Bio IE Lab, The Applied Optimization Group, Industrial Engineering Department, University of Puerto Rico, Mayaguez, Puerto Rico
| | - Clara E Isaza
- Bio IE Lab, The Applied Optimization Group, Industrial Engineering Department, University of Puerto Rico, Mayaguez, Puerto Rico.,Public Health Program, Ponce Health Sciences University, Ponce, Puerto Rico
| |
Collapse
|
11
|
Nguyen T, Khosravi A, Creighton D, Nahavandi S. Hidden Markov models for cancer classification using gene expression profiles. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2015.04.012] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
12
|
Saule C, Giegerich R. Pareto optimization in algebraic dynamic programming. Algorithms Mol Biol 2015; 10:22. [PMID: 26150892 PMCID: PMC4491898 DOI: 10.1186/s13015-015-0051-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 05/07/2015] [Indexed: 11/10/2022] Open
Abstract
Pareto optimization combines independent objectives by computing the Pareto front of its search space, defined as the set of all solutions for which no other candidate solution scores better under all objectives. This gives, in a precise sense, better information than an artificial amalgamation of different scores into a single objective, but is more costly to compute. Pareto optimization naturally occurs with genetic algorithms, albeit in a heuristic fashion. Non-heuristic Pareto optimization so far has been used only with a few applications in bioinformatics. We study exact Pareto optimization for two objectives in a dynamic programming framework. We define a binary Pareto product operator [Formula: see text] on arbitrary scoring schemes. Independent of a particular algorithm, we prove that for two scoring schemes A and B used in dynamic programming, the scoring scheme [Formula: see text] correctly performs Pareto optimization over the same search space. We study different implementations of the Pareto operator with respect to their asymptotic and empirical efficiency. Without artificial amalgamation of objectives, and with no heuristics involved, Pareto optimization is faster than computing the same number of answers separately for each objective. For RNA structure prediction under the minimum free energy versus the maximum expected accuracy model, we show that the empirical size of the Pareto front remains within reasonable bounds. Pareto optimization lends itself to the comparative investigation of the behavior of two alternative scoring schemes for the same purpose. For the above scoring schemes, we observe that the Pareto front can be seen as a composition of a few macrostates, each consisting of several microstates that differ in the same limited way. We also study the relationship between abstract shape analysis and the Pareto front, and find that they extract information of a different nature from the folding space and can be meaningfully combined.
Collapse
|
13
|
Klammer M, Dybowski JN, Hoffmann D, Schaab C. Pareto Optimization Identifies Diverse Set of Phosphorylation Signatures Predicting Response to Treatment with Dasatinib. PLoS One 2015; 10:e0128542. [PMID: 26083411 PMCID: PMC4470654 DOI: 10.1371/journal.pone.0128542] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2015] [Accepted: 04/26/2015] [Indexed: 01/17/2023] Open
Abstract
Multivariate biomarkers that can predict the effectiveness of targeted therapy in individual patients are highly desired. Previous biomarker discovery studies have largely focused on the identification of single biomarker signatures, aimed at maximizing prediction accuracy. Here, we present a different approach that identifies multiple biomarkers by simultaneously optimizing their predictive power, number of features, and proximity to the drug target in a protein-protein interaction network. To this end, we incorporated NSGA-II, a fast and elitist multi-objective optimization algorithm that is based on the principle of Pareto optimality, into the biomarker discovery workflow. The method was applied to quantitative phosphoproteome data of 19 non-small cell lung cancer (NSCLC) cell lines from a previous biomarker study. The algorithm successfully identified a total of 77 candidate biomarker signatures predicting response to treatment with dasatinib. Through filtering and similarity clustering, this set was trimmed to four final biomarker signatures, which then were validated on an independent set of breast cancer cell lines. All four candidates reached the same good prediction accuracy (83%) as the originally published biomarker. Although the newly discovered signatures were diverse in their composition and in their size, the central protein of the originally published signature — integrin β4 (ITGB4) — was also present in all four Pareto signatures, confirming its pivotal role in predicting dasatinib response in NSCLC cell lines. In summary, the method presented here allows for a robust and simultaneous identification of multiple multivariate biomarkers that are optimized for prediction performance, size, and relevance.
Collapse
Affiliation(s)
- Martin Klammer
- Evotec (München) GmbH, Dept. of Bioinformatics, Am Klopferspitz 19a, 82152 Martinsried, Germany
| | - J. Nikolaj Dybowski
- Evotec (München) GmbH, Dept. of Bioinformatics, Am Klopferspitz 19a, 82152 Martinsried, Germany
| | - Daniel Hoffmann
- Center for Medical Biotechnology, University of Duisburg-Essen, Universitätsstrasse 1-4, 45141 Essen, Germany
| | - Christoph Schaab
- Evotec (München) GmbH, Dept. of Bioinformatics, Am Klopferspitz 19a, 82152 Martinsried, Germany
- Max-Plack Institute for Biochemistry, Am Klopferspitz 18, 82152 Martinsried, Germany
- * E-mail:
| |
Collapse
|
14
|
Chakraborty D, Maulik U. Identifying Cancer Biomarkers From Microarray Data Using Feature Selection and Semisupervised Learning. IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE-JTEHM 2014; 2:4300211. [PMID: 27170887 PMCID: PMC4848046 DOI: 10.1109/jtehm.2014.2375820] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2014] [Revised: 09/20/2014] [Accepted: 11/22/2014] [Indexed: 11/07/2022]
Abstract
Microarrays have now gone from obscurity to being almost ubiquitous in biological research. At the same time, the statistical methodology for microarray analysis has progressed from simple visual assessments of results to novel algorithms for analyzing changes in expression profiles. In a micro-RNA (miRNA) or gene-expression profiling experiment, the expression levels of thousands of genes/miRNAs are simultaneously monitored to study the effects of certain treatments, diseases, and developmental stages on their expressions. Microarray-based gene expression profiling can be used to identify genes, whose expressions are changed in response to pathogens or other organisms by comparing gene expression in infected to that in uninfected cells or tissues. Recent studies have revealed that patterns of altered microarray expression profiles in cancer can serve as molecular biomarkers for tumor diagnosis, prognosis of disease-specific outcomes, and prediction of therapeutic responses. Microarray data sets containing expression profiles of a number of miRNAs or genes are used to identify biomarkers, which have dysregulation in normal and malignant tissues. However, small sample size remains a bottleneck to design successful classification methods. On the other hand, adequate number of microarray data that do not have clinical knowledge can be employed as additional source of information. In this paper, a combination of kernelized fuzzy rough set (KFRS) and semisupervised support vector machine (S(3)VM) is proposed for predicting cancer biomarkers from one miRNA and three gene expression data sets. Biomarkers are discovered employing three feature selection methods, including KFRS. The effectiveness of the proposed KFRS and S(3)VM combination on the microarray data sets is demonstrated, and the cancer biomarkers identified from miRNA data are reported. Furthermore, biological significance tests are conducted for miRNA cancer biomarkers.
Collapse
|
15
|
Gu JL, Lu Y, Liu C, Lu H. Multiclass classification of sarcomas using pathway based feature selection method. J Theor Biol 2014; 362:3-8. [DOI: 10.1016/j.jtbi.2014.06.038] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2014] [Revised: 06/03/2014] [Accepted: 06/28/2014] [Indexed: 12/17/2022]
|
16
|
Rathore S, Hussain M, Khan A. GECC: Gene Expression Based Ensemble Classification of Colon Samples. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:1131-1145. [PMID: 26357050 DOI: 10.1109/tcbb.2014.2344655] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene expression deviates from its normal composition in case a patient has cancer. This variation can be used as an effective tool to find cancer. In this study, we propose a novel gene expressions based colon classification scheme (GECC) that exploits the variations in gene expressions for classifying colon gene samples into normal and malignant classes. Novelty of GECC is in two complementary ways. First, to cater overwhelmingly larger size of gene based data sets, various feature extraction strategies, like, chi-square, F-Score, principal component analysis (PCA) and minimum redundancy and maximum relevancy (mRMR) have been employed, which select discriminative genes amongst a set of genes. Second, a majority voting based ensemble of support vector machine (SVM) has been proposed to classify the given gene based samples. Previously, individual SVM models have been used for colon classification, however, their performance is limited. In this research study, we propose an SVM-ensemble based new approach for gene based classification of colon, wherein the individual SVM models are constructed through the learning of different SVM kernels, like, linear, polynomial, radial basis function (RBF), and sigmoid. The predicted results of individual models are combined through majority voting. In this way, the combined decision space becomes more discriminative. The proposed technique has been tested on four colon, and several other binary-class gene expression data sets, and improved performance has been achieved compared to previously reported gene based colon cancer detection techniques. The computational time required for the training and testing of 208 × 5,851 data set has been 591.01 and 0.019 s, respectively.
Collapse
|
17
|
Yang D, Parrish RS, Brock GN. Empirical evaluation of consistency and accuracy of methods to detect differentially expressed genes based on microarray data. Comput Biol Med 2013; 46:1-10. [PMID: 24529200 DOI: 10.1016/j.compbiomed.2013.12.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2013] [Revised: 12/02/2013] [Accepted: 12/03/2013] [Indexed: 11/16/2022]
Abstract
BACKGROUND In this study, we empirically evaluated the consistency and accuracy of five different methods to detect differentially expressed genes (DEGs) based on microarray data. METHODS Five different methods were compared, including the t-test, significance analysis of microarrays (SAM), the empirical Bayes t-test (eBayes), t-tests relative to a threshold (TREAT), and assumption adequacy averaging (AAA). The percentage of overlapping genes (POG) and the percentage of overlapping genes related (POGR) scores were used to rank the different methods on their ability to maintain a consistent list of DEGs both within the same data set and across two different data sets concerning the same disease. The power of each method was evaluated based on a simulation approach which mimics the multivariate distribution of the original microarray data. RESULTS For smaller sample sizes (6 or less per group), moderated versions of the t-test (SAM, eBayes, and TREAT) were superior in terms of both power and consistency relative to the t-test and AAA, with TREAT having the highest consistency in each scenario. Differences in consistency were most pronounced for comparisons between two different data sets for the same disease. For larger sample sizes AAA had the highest power for detecting small effect sizes, while TREAT had the lowest. DISCUSSION For smaller sample sizes moderated versions of the t-test can generally be recommended, while for larger sample sizes selection of a method to detect DEGs may involve a compromise between consistency and power.
Collapse
Affiliation(s)
- Dake Yang
- Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, University of Louisville, Louisville, KY 40202, United States.
| | - Rudolph S Parrish
- Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, University of Louisville, Louisville, KY 40202, United States.
| | - Guy N Brock
- Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, University of Louisville, Louisville, KY 40202, United States.
| |
Collapse
|