1
|
Drozdz A, McInerney CE, Prise KM, Spence VJ, Sousa J. Signature Genes Selection and Functional Analysis of Astrocytoma Phenotypes: A Comparative Study. Cancers (Basel) 2024; 16:3263. [PMID: 39409884 PMCID: PMC11476064 DOI: 10.3390/cancers16193263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Revised: 09/06/2024] [Accepted: 09/12/2024] [Indexed: 10/20/2024] Open
Abstract
Novel cancer biomarkers discoveries are driven by the application of omics technologies. The vast quantity of highly dimensional data necessitates the implementation of feature selection. The mathematical basis of different selection methods varies considerably, which may influence subsequent inferences. In the study, feature selection and classification methods were employed to identify six signature gene sets of grade 2 and 3 astrocytoma samples from the Rembrandt repository. Subsequently, the impact of these variables on classification and further discovery of biological patterns was analysed. Principal component analysis (PCA), uniform manifold approximation and projection (UMAP), and hierarchical clustering revealed that the data set (10,096 genes) exhibited a high degree of noise, feature redundancy, and lack of distinct patterns. The application of feature selection methods resulted in a reduction in the number of genes to between 28 and 128. Notably, no single gene was selected by all of the methods tested. Selection led to an increase in classification accuracy and noise reduction. Significant differences in the Gene Ontology terms were discovered, with only 13 terms overlapping. One selection method did not result in any enriched terms. KEGG pathway analysis revealed only one pathway in common (cell cycle), while the two methods did not yield any enriched pathways. The results demonstrated a significant difference in outcomes when classification-type algorithms were utilised in comparison to mixed types (selection and classification). This may result in the inadvertent omission of biological phenomena, while simultaneously achieving enhanced classification outcomes.
Collapse
Affiliation(s)
- Anna Drozdz
- Sano—Centre for Computational Personalised Medicine-International Research Foundation, Czarnowiejska 36, 30-054 Kraków, Poland;
| | - Caitriona E. McInerney
- Patrick G. Johnson Centre for Cancer Research, Queen’s University Belfast, BT9 7AE Belfast, Ireland; (C.E.M.); (K.M.P.)
| | - Kevin M. Prise
- Patrick G. Johnson Centre for Cancer Research, Queen’s University Belfast, BT9 7AE Belfast, Ireland; (C.E.M.); (K.M.P.)
| | - Veronica J. Spence
- Patrick G. Johnson Centre for Cancer Research, Queen’s University Belfast, BT9 7AE Belfast, Ireland; (C.E.M.); (K.M.P.)
| | - Jose Sousa
- Sano—Centre for Computational Personalised Medicine-International Research Foundation, Czarnowiejska 36, 30-054 Kraków, Poland;
| |
Collapse
|
2
|
Epstein E, Nallapareddy N, Ray S. On the Relationship between Feature Selection Metrics and Accuracy. ENTROPY (BASEL, SWITZERLAND) 2023; 25:1646. [PMID: 38136526 PMCID: PMC10742436 DOI: 10.3390/e25121646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 12/02/2023] [Accepted: 12/08/2023] [Indexed: 12/24/2023]
Abstract
Feature selection metrics are commonly used in the machine learning pipeline to rank and select features before creating a predictive model. While many different metrics have been proposed for feature selection, final models are often evaluated by accuracy. In this paper, we consider the relationship between common feature selection metrics and accuracy. In particular, we focus on misorderings: cases where a feature selection metric may rank features differently than accuracy would. We analytically investigate the frequency of misordering for a variety of feature selection metrics as a function of parameters that represent how a feature partitions the data. Our analysis reveals that different metrics have systematic differences in how likely they are to misorder features which can happen over a wide range of partition parameters. We then perform an empirical evaluation with different feature selection metrics on several real-world datasets to measure misordering. Our empirical results generally match our analytical results, illustrating that misordering features happens in practice and can provide some insight into the performance of feature selection metrics.
Collapse
Affiliation(s)
| | | | - Soumya Ray
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA;
| |
Collapse
|
3
|
Dang T, Fermin ASR, Machizawa MG. oFVSD: a Python package of optimized forward variable selection decoder for high-dimensional neuroimaging data. Front Neuroinform 2023; 17:1266713. [PMID: 37829329 PMCID: PMC10566623 DOI: 10.3389/fninf.2023.1266713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 09/08/2023] [Indexed: 10/14/2023] Open
Abstract
The complexity and high dimensionality of neuroimaging data pose problems for decoding information with machine learning (ML) models because the number of features is often much larger than the number of observations. Feature selection is one of the crucial steps for determining meaningful target features in decoding; however, optimizing the feature selection from such high-dimensional neuroimaging data has been challenging using conventional ML models. Here, we introduce an efficient and high-performance decoding package incorporating a forward variable selection (FVS) algorithm and hyper-parameter optimization that automatically identifies the best feature pairs for both classification and regression models, where a total of 18 ML models are implemented by default. First, the FVS algorithm evaluates the goodness-of-fit across different models using the k-fold cross-validation step that identifies the best subset of features based on a predefined criterion for each model. Next, the hyperparameters of each ML model are optimized at each forward iteration. Final outputs highlight an optimized number of selected features (brain regions of interest) for each model with its accuracy. Furthermore, the toolbox can be executed in a parallel environment for efficient computation on a typical personal computer. With the optimized forward variable selection decoder (oFVSD) pipeline, we verified the effectiveness of decoding sex classification and age range regression on 1,113 structural magnetic resonance imaging (MRI) datasets. Compared to ML models without the FVS algorithm and with the Boruta algorithm as a variable selection counterpart, we demonstrate that the oFVSD significantly outperformed across all of the ML models over the counterpart models without FVS (approximately 0.20 increase in correlation coefficient, r, with regression models and 8% increase in classification models on average) and with Boruta variable selection algorithm (approximately 0.07 improvement in regression and 4% in classification models). Furthermore, we confirmed the use of parallel computation considerably reduced the computational burden for the high-dimensional MRI data. Altogether, the oFVSD toolbox efficiently and effectively improves the performance of both classification and regression ML models, providing a use case example on MRI datasets. With its flexibility, oFVSD has the potential for many other modalities in neuroimaging. This open-source and freely available Python package makes it a valuable toolbox for research communities seeking improved decoding accuracy.
Collapse
Affiliation(s)
- Tung Dang
- Center for Brain, Mind, and KANSEI Sciences Research, Hiroshima University, Hiroshima, Japan
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| | - Alan S. R. Fermin
- Center for Brain, Mind, and KANSEI Sciences Research, Hiroshima University, Hiroshima, Japan
| | - Maro G. Machizawa
- Center for Brain, Mind, and KANSEI Sciences Research, Hiroshima University, Hiroshima, Japan
| |
Collapse
|
4
|
Zhang F, Zhang R, Wei M, Li G. A machine learning based approach for quantitative evaluation of cell migration in Transwell assays based on deformation characteristics. Analyst 2023; 148:1371-1382. [PMID: 36857714 DOI: 10.1039/d2an01882a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023]
Abstract
Many pathological and physiological processes, including embryonic development, immune response and cancer metastasis, involve studies on cell migration, and especially detection methods, for which it is difficult to satisfy the requirements for rapid and quantitative evaluation and analysis. In view of the shortcomings in simultaneously quantifying the number of migrated cells and non-migrated cells using Transwell assays, we propose a novelty approach for the evaluation of cell migration by distinguishing whether the cells have migrated based on the regularity of the cell morphology changes. Traditionally, the status of living cells and dead cells are detected and analyzed by machine learning using some common morphological characteristics, e.g., area and perimeter of the cells. However, the accuracy of detecting whether cells have migrated or not using these common characteristics is not high, and the characteristics are not appropriate for our studies. Therefore, from the point of view of mechanism analysis for the migration behavior, we examined the regularity of different morphology changes of migrated cells and non-migrated cells, and thus discovered the distinguishable morphological characteristics. Then, two deformation characteristics, deformation index and taper index are proposed. Then, a machine learning based algorithm that can identify migrated cells according to the proposed deformation characteristics was devised. In addition, images of migrated cells and non-migrated cells were obtained from the Transwell assays. This algorithm was trained, and was able to successfully identify migrated cells with an accuracy of 84% using the proposed morphological characteristics. This method greatly improves the identification accuracy when compared with the identification of traditional characteristics of which the accuracy was about 54.7%. This machine learning based method might be employed as a potential tool for cell counting and evaluation of cell migration with the aim of reducing time and improving automation compared with the traditional method. This method is effective, rapid, and incorporate advances in artificial intelligence which could be used for adapting the current evaluation methods.
Collapse
Affiliation(s)
- Fei Zhang
- School of Electrical and Information Engineering, Jiangsu University, Zhenjiang, Jiangsu 212013, China.
| | - Rongbiao Zhang
- School of Electrical and Information Engineering, Jiangsu University, Zhenjiang, Jiangsu 212013, China.
| | - Mingji Wei
- School of Electrical and Information Engineering, Jiangsu University, Zhenjiang, Jiangsu 212013, China.
| | - Guoxiao Li
- School of Information Engineering, Jiangsu Vocational College of Agriculture and Forestry, Jurong, Jiangsu 212400, China
| |
Collapse
|
5
|
Satake H, Osugi T, Shiraishi A. Impact of Machine Learning-Associated Research Strategies on the Identification of Peptide-Receptor Interactions in the Post-Omics Era. Neuroendocrinology 2023; 113:251-261. [PMID: 34348315 DOI: 10.1159/000518572] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 07/19/2021] [Indexed: 11/19/2022]
Abstract
BACKGROUNDS Elucidation of peptide-receptor pairs is a prerequisite for many studies in the neuroendocrine, endocrine, and neuroscience fields. Recent omics analyses have provided vast amounts of peptide and G protein-coupled receptor (GPCR) sequence data. GPCRs for homologous peptides are easily characterized based on homology searching, and the relevant peptide-GPCR interactions are also detected by typical signaling assays. In contrast, conventional evaluation or prediction methods, including high-throughput reverse-pharmacological assays and tertiary structure-based computational analyses, are not useful for identifying interactions between novel and omics-derived peptides and GPCRs. SUMMARY Recently, an approach combining machine learning-based prediction of novel peptide-GPCR pairs and experimental validation of the predicted pairs have been shown to breakthrough this bottleneck. A machine learning method, logistic regression for human class A GPCRs and the multiple subsequent signaling assays led to the deorphanization of human class A orphan GPCRs, namely, the identification of 18 peptide-GPCR pairs. Furthermore, using another machine learning algorithm, the support vector machine (SVM), the peptide descriptor-incorporated SVM was originally developed and employed to predict GPCRs for novel peptides characterized from the closest relative of vertebrates, Ciona intestinalis Type A (Ciona robusta). Experimental validation of the predicted pairs eventually led to the identification of 11 novel peptide-GPCR pairs. Of particular interest is that these newly identified GPCRs displayed neither significant sequence similarity nor molecular phylogenetic relatedness to known GPCRs for peptides. KEY MESSAGES These recent studies highlight the usefulness and versatility of machine learning for enabling the efficient, reliable, and systematic identification of novel peptide-GPCR interactions.
Collapse
Affiliation(s)
- Honoo Satake
- Division of Integrative Biomolecular Function, Bioorganic Research Institute, Suntory Foundation for Life Sciences, Kyoto, Japan
| | - Tomohiro Osugi
- Division of Integrative Biomolecular Function, Bioorganic Research Institute, Suntory Foundation for Life Sciences, Kyoto, Japan
| | - Akira Shiraishi
- Division of Integrative Biomolecular Function, Bioorganic Research Institute, Suntory Foundation for Life Sciences, Kyoto, Japan
| |
Collapse
|
6
|
Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis. BIG DATA AND COGNITIVE COMPUTING 2022. [DOI: 10.3390/bdcc6010024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In high-dimensional data analysis, Feature Selection (FS) is one of the most fundamental issues in machine learning and requires the attention of researchers. These datasets are characterized by huge space due to a high number of features, out of which only a few are significant for analysis. Thus, significant feature extraction is crucial. There are various techniques available for feature selection; among them, the filter techniques are significant in this community, as they can be used with any type of learning algorithm and drastically lower the running time of optimization algorithms and improve the performance of the model. Furthermore, the application of a filter approach depends on the characteristics of the dataset as well as on the machine learning model. Thus, to avoid these issues in this research, a combination of feature reduction (CFR) is considered designing a pipeline of filter approaches for high-dimensional microarray data classification. Considering four filter approaches, sixteen combinations of pipelines are generated. The feature subset is reduced in different levels, and ultimately, the significant feature set is evaluated. The pipelined filter techniques are Correlation-Based Feature Selection (CBFS), Chi-Square Test (CST), Information Gain (InG), and Relief Feature Selection (RFS), and the classification techniques are Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and k-Nearest Neighbor (k-NN). The performance of CFR depends highly on the datasets as well as on the classifiers. Thereafter, the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method is used for ranking all reduction combinations and evaluating the superior filter combination among all.
Collapse
|
7
|
Application of Metaheuristic Approaches for the Variable Selection Problem. INTERNATIONAL JOURNAL OF APPLIED METAHEURISTIC COMPUTING 2022. [DOI: 10.4018/ijamc.298309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Variable selection is an old topic from regression models. Besides many conventional approaches, some metaheuristic approaches from the realm of optimization such as GA (Genetic Algorithm) or simulated annealing have been suggested to date. These methods have a considerable advantage to deal with many problems over the classical methods, but they must control relevant fine-tuning parameters associated with cross-over or mutation, which can be difficult and time-consuming. In this paper, Jaya, one of several parameter-free approaches will be suggested and explored. Several metaheuristic methods will be compared using results from a real-world dataset and a simulated dataset. The impact of using local search will be analyzed.
Collapse
|
8
|
Feltes BC, Poloni JDF, Dorn M. Benchmarking and Testing Machine Learning Approaches with BARRA:CuRDa, a Curated RNA-Seq Database for Cancer Research. J Comput Biol 2021; 28:931-944. [PMID: 34264745 DOI: 10.1089/cmb.2020.0463] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
RNA-seq is gradually becoming the dominating technique employed to access the global gene expression in biological samples, allowing more flexible protocols and robust analysis. However, the nature of RNA-seq results imposes new data-handling challenges when it comes to computational analysis. With the increasing employment of machine learning (ML) techniques in biomedical sciences, databases that could provide curated data sets treated with state-of-the-art approaches already adapted to ML protocols, become essential for testing new algorithms. In this study, we present the Benchmarking of ARtificial intelligence Research: Curated RNA-seq Database (BARRA:CuRDa). BARRA:CuRDa was built exclusively for cancer research and is composed of 17 handpicked RNA-seq data sets for Homo sapiens that were gathered from the Gene Expression Omnibus, using rigorous filtering criteria. All data sets were individually submitted to sample quality analysis, removal of low-quality bases and artifacts from the experimental process, removal of ribosomal RNA, and estimation of transcript-level abundance. Moreover, all data sets were tested using standard approaches in the field, which allows them to be used as benchmark to new ML approaches. A feature selection analysis was also performed on each data set to investigate the biological accuracy of basic techniques. Results include genes already related to their specific tumoral tissue a large amount of long noncoding RNA and pseudogenes. BARRA:CuRDa is available at http://sbcb.inf.ufrgs.br/barracurda.
Collapse
Affiliation(s)
- Bruno César Feltes
- Institute of Informatics, Department of Theoretical Computer Science, Federal University of Rio Grande do Sul, Porto Alegre, Brazil.,Institute of Biosciences, Department of Biophysics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil
| | - Joice De Faria Poloni
- Institute of Informatics, Department of Theoretical Computer Science, Federal University of Rio Grande do Sul, Porto Alegre, Brazil.,EMBRAPA Agroenergy, Distrito Federal, Brasília, Brazil
| | - Márcio Dorn
- Institute of Informatics, Department of Theoretical Computer Science, Federal University of Rio Grande do Sul, Porto Alegre, Brazil.,Center of Biotechnology, Federal University of Rio Grande do Sul, Porto Alegre, Brazil.,National Institute of Science and Technology, Forensic Science, Porto Alegre, Brazil
| |
Collapse
|
9
|
Hameed SS, Hassan R, Hassan WH, Muhammadsharif FF, Latiff LA. HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets. PLoS One 2021; 16:e0246039. [PMID: 33507983 PMCID: PMC7842997 DOI: 10.1371/journal.pone.0246039] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Accepted: 01/12/2021] [Indexed: 11/24/2022] Open
Abstract
The selection and classification of genes is essential for the identification of related genes to a specific disease. Developing a user-friendly application with combined statistical rigor and machine learning functionality to help the biomedical researchers and end users is of great importance. In this work, a novel stand-alone application, which is based on graphical user interface (GUI), is developed to perform the full functionality of gene selection and classification in high dimensional datasets. The so-called HDG-select application is validated on eleven high dimensional datasets of the format CSV and GEO soft. The proposed tool uses the efficient algorithm of combined filter-GBPSO-SVM and it was made freely available to users. It was found that the proposed HDG-select outperformed other tools reported in literature and presented a competitive performance, accessibility, and functionality.
Collapse
Affiliation(s)
- Shilan S. Hameed
- Computer Systems and Networks (CSN), Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
- Directorate of Information Technology, Koya University, Koya, Kurdistan Region-F.R., Iraq
| | - Rohayanti Hassan
- School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Johor Bahru, Johor, Malaysia
| | - Wan Haslina Hassan
- Computer Systems and Networks (CSN), Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
| | - Fahmi F. Muhammadsharif
- Department of Physics, Faculty of Science and Health, Koya University, Koya, Kurdistan Region-F.R., Iraq
| | - Liza Abdul Latiff
- U-BAN Research Group, Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
| |
Collapse
|
10
|
Zhong Y, Chalise P, He J. Nested cross-validation with ensemble feature selection and classification model for high-dimensional biological data. COMMUN STAT-SIMUL C 2020. [DOI: 10.1080/03610918.2020.1850790] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Yi Zhong
- Department of Biostatistics and Data Science, University of Kansas Medical Center, Kansas City, KS, USA
| | - Prabhakar Chalise
- Department of Biostatistics and Data Science, University of Kansas Medical Center, Kansas City, KS, USA
| | - Jianghua He
- Department of Biostatistics and Data Science, University of Kansas Medical Center, Kansas City, KS, USA
| |
Collapse
|
11
|
Tripto NI, Kabir M, Bayzid MS, Rahman A. Evaluation of classification and forecasting methods on time series gene expression data. PLoS One 2020; 15:e0241686. [PMID: 33156855 PMCID: PMC7647064 DOI: 10.1371/journal.pone.0241686] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Accepted: 10/20/2020] [Indexed: 11/18/2022] Open
Abstract
Time series gene expression data is widely used to study different dynamic biological processes. Although gene expression datasets share many of the characteristics of time series data from other domains, most of the analyses in this field do not fully leverage the time-ordered nature of the data and focus on clustering the genes based on their expression values. Other domains, such as financial stock and weather prediction, utilize time series data for forecasting purposes. Moreover, many studies have been conducted to classify generic time series data based on trend, seasonality, and other patterns. Therefore, an assessment of these approaches on gene expression data would be of great interest to evaluate their adequacy in this domain. Here, we perform a comprehensive evaluation of different traditional unsupervised and supervised machine learning approaches as well as deep learning based techniques for time series gene expression classification and forecasting on five real datasets. In addition, we propose deep learning based methods for both classification and forecasting, and compare their performances with the state-of-the-art methods. We find that deep learning based methods generally outperform traditional approaches for time series classification. Experiments also suggest that supervised classification on gene expression is more effective than clustering when labels are available. In time series gene expression forecasting, we observe that an autoregressive statistical approach has the best performance for short term forecasting, whereas deep learning based methods are better suited for long term forecasting.
Collapse
Affiliation(s)
- Nafis Irtiza Tripto
- Department of Computer Science and Engineering, Bangladesh University of Engineering & Technology, Dhaka, Bangladesh
- * E-mail: (MK); (NIT)
| | - Mohimenul Kabir
- Department of Computer Science and Engineering, Bangladesh University of Engineering & Technology, Dhaka, Bangladesh
- * E-mail: (MK); (NIT)
| | - Md. Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering & Technology, Dhaka, Bangladesh
| | - Atif Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering & Technology, Dhaka, Bangladesh
| |
Collapse
|
12
|
|
13
|
Pasupa K, Rathasamuth W, Tongsima S. Discovery of significant porcine SNPs for swine breed identification by a hybrid of information gain, genetic algorithm, and frequency feature selection technique. BMC Bioinformatics 2020; 21:216. [PMID: 32456608 PMCID: PMC7251909 DOI: 10.1186/s12859-020-3471-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Accepted: 03/25/2020] [Indexed: 11/21/2022] Open
Abstract
Background The number of porcine Single Nucleotide Polymorphisms (SNPs) used in genetic association studies is very large, suitable for statistical testing. However, in breed classification problem, one needs to have a much smaller porcine-classifying SNPs (PCSNPs) set that could accurately classify pigs into different breeds. This study attempted to find such PCSNPs by using several combinations of feature selection and classification methods. We experimented with different combinations of feature selection methods including information gain, conventional as well as modified genetic algorithms, and our developed frequency feature selection method in combination with a common classification method, Support Vector Machine, to evaluate the method’s performance. Experiments were conducted on a comprehensive data set containing SNPs from native pigs from America, Europe, Africa, and Asia including Chinese breeds, Vietnamese breeds, and hybrid breeds from Thailand. Results The best combination of feature selection methods—information gain, modified genetic algorithm, and frequency feature selection hybrid—was able to reduce the number of possible PCSNPs to only 1.62% (164 PCSNPs) of the total number of SNPs (10,210 SNPs) while maintaining a high classification accuracy (95.12%). Moreover, the near-identical performance of this PCSNPs set to those of bigger data sets as well as even the entire data set. Moreover, most PCSNPs were well-matched to a set of 94 genes in the PANTHER pathway, conforming to a suggestion by the Porcine Genomic Sequencing Initiative. Conclusions The best hybrid method truly provided a sufficiently small number of porcine SNPs that accurately classified swine breeds.
Collapse
Affiliation(s)
- Kitsuchart Pasupa
- Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, 10520, Thailand.
| | - Wanthanee Rathasamuth
- Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, 10520, Thailand
| | - Sissades Tongsima
- National Biobank of Thailand, National Science and Technology Development Agency, Khong Luang, 12120, Thailand
| |
Collapse
|
14
|
Fu GH, Wu YJ, Zong MJ, Pan J. Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data. BMC Bioinformatics 2020; 21:121. [PMID: 32293252 PMCID: PMC7092448 DOI: 10.1186/s12859-020-3411-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Accepted: 02/12/2020] [Indexed: 11/11/2022] Open
Abstract
Background Feature selection in class-imbalance learning has gained increasing attention in recent years due to the massive growth of high-dimensional class-imbalanced data across many scientific fields. In addition to reducing model complexity and discovering key biomarkers, feature selection is also an effective method of combating overlapping which may arise in such data and become a crucial aspect for determining classification performance. However, ordinary feature selection techniques for classification can not be simply used for addressing class-imbalanced data without any adjustment. Thus, more efficient feature selection technique must be developed for complicated class-imbalanced data, especially in the context of high-dimensionality. Results We proposed an algorithm called sssHD to achieve stable sparse feature selection applied it to complicated class-imbalanced data. sssHD is based on the Hellinger distance (HD) coupled with sparse regularization techniques. We stated that Hellinger distance is not only class-insensitive but also translation-invariant. Simulation result indicates that HD-based selection algorithm is effective in recognizing key features and control false discoveries for class-imbalance learning. Five gene expression datasets are also employed to test the performance of the sssHD algorithm, and a comparison with several existing selection procedures is performed. The result shows that sssHD is highly competitive in terms of five assessment metrics. In addition, sssHD presents limited differences between performing and not performing re-balance preprocessing. Conclusions sssHD is a practical feature selection method for high-dimensional class-imbalanced data, which is simple and can be an alternative for performing feature selection in class-imbalanced data. sssHD can be easily extended by connecting it with different re-balance preprocessing, different sparse regularization structures as well as different classifiers. As such, the algorithm is extremely general and has a wide range of applicability.
Collapse
Affiliation(s)
- Guang-Hui Fu
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China.
| | - Yuan-Jiao Wu
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China
| | - Min-Jie Zong
- School of Science, Kunming University of Science and Technology, Kunming, 650500, People's Republic of China
| | - Jianxin Pan
- School of Mathematics, The University of Manchester, Manchester, M13 9PL, UK
| |
Collapse
|
15
|
Yang Q, Chen Q, Zhang M, Cai Y, Yang F, Zhang J, Deng G, Ye T, Deng Q, Li G, Zhang H, Yi Y, Huang RP, Chen X. Identification of eight-protein biosignature for diagnosis of tuberculosis. Thorax 2020; 75:576-583. [PMID: 32201389 PMCID: PMC7361018 DOI: 10.1136/thoraxjnl-2018-213021] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2018] [Revised: 02/18/2020] [Accepted: 02/23/2020] [Indexed: 02/07/2023]
Abstract
Background Biomarker-based tests for diagnosing TB currently rely on detecting Mycobacterium tuberculosis (Mtb) antigen-specific cellular responses. While this approach can detect Mtb infection, it is not efficient in diagnosing TB, especially for patients who lack aetiological evidence of the disease. Methods We prospectively enrolled three cohorts for our study for a total of 630 subjects, including 160 individuals to screen protein biomarkers of TB, 368 individuals to establish and test the predictive model and 102 individuals for biomarker validation. Whole blood cultures were stimulated with pooled Mtb-peptides or mitogen, and 640 proteins within the culture supernatant were analysed simultaneously using an antibody-based array. Sixteen candidate biomarkers of TB identified during screening were then developed into a custom multiplexed antibody array for biomarker validation. Results A two-round screening strategy identified eight-protein biomarkers of TB: I-TAC, I-309, MIG, Granulysin, FAP, MEP1B, Furin and LYVE-1. The sensitivity and specificity of the eight-protein biosignature in diagnosing TB were determined for the training (n=276), test (n=92) and prediction (n=102) cohorts. The training cohort had a 100% specificity (95% CI 98% to 100%) and 100% sensitivity (95% CI 96% to 100%) using a random forest algorithm approach by cross-validation. In the test cohort, the specificity and sensitivity were 83% (95% CI 71% to 91%) and 76% (95% CI 56% to 90%), respectively. In the prediction cohort, the specificity was 84% (95% CI 74% to 92%) and the sensitivity was 75% (95% CI 57% to 89%). Conclusions An eight-protein biosignature to diagnose TB in a high-burden TB clinical setting was identified.
Collapse
Affiliation(s)
- Qianting Yang
- National Clinical Research Center for Infectious Diseases, Guangdong Key Laboratory for Diagnosis & Treatment of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen, China
| | - Qi Chen
- National Clinical Research Center for Infectious Diseases, Guangdong Key Laboratory for Diagnosis & Treatment of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen, China
| | - Mingxia Zhang
- National Clinical Research Center for Infectious Diseases, Guangdong Key Laboratory for Diagnosis & Treatment of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen, China
| | - Yi Cai
- Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Department of Pathogen Biology, Shenzhen University School of Medicine, Shenzhen, China
| | - Fan Yang
- Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Department of Pathogen Biology, Shenzhen University School of Medicine, Shenzhen, China
| | - Jieyun Zhang
- National Clinical Research Center for Infectious Diseases, Guangdong Key Laboratory for Diagnosis & Treatment of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen, China
| | - Guofang Deng
- National Clinical Research Center for Infectious Diseases, Guangdong Key Laboratory for Diagnosis & Treatment of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen, China
| | - Taosheng Ye
- National Clinical Research Center for Infectious Diseases, Guangdong Key Laboratory for Diagnosis & Treatment of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen, China
| | - Qunyi Deng
- National Clinical Research Center for Infectious Diseases, Guangdong Key Laboratory for Diagnosis & Treatment of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen, China
| | - Guobao Li
- National Clinical Research Center for Infectious Diseases, Guangdong Key Laboratory for Diagnosis & Treatment of Emerging Infectious Diseases, Shenzhen Third People's Hospital, Shenzhen, China
| | - Huihua Zhang
- South China Biochip Research Center, RayBiotech, Guangzhou, China.,Raybiotech Center, RayBiotech, Norcross, Georgia, USA
| | - Yuhua Yi
- South China Biochip Research Center, RayBiotech, Guangzhou, China.,Raybiotech Center, RayBiotech, Norcross, Georgia, USA
| | - Ruo-Pan Huang
- South China Biochip Research Center, RayBiotech, Guangzhou, China .,Raybiotech Center, RayBiotech, Norcross, Georgia, USA
| | - Xinchun Chen
- Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Department of Pathogen Biology, Shenzhen University School of Medicine, Shenzhen, China
| |
Collapse
|
16
|
A Wrapper Feature Subset Selection Method Based on Randomized Search and Multilayer Structure. BIOMED RESEARCH INTERNATIONAL 2019; 2019:9864213. [PMID: 31828154 PMCID: PMC6885241 DOI: 10.1155/2019/9864213] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 08/10/2019] [Accepted: 08/27/2019] [Indexed: 12/11/2022]
Abstract
The identification of discriminative features from information-rich data with the goal of clinical diagnosis is crucial in the field of biomedical science. In this context, many machine-learning techniques have been widely applied and achieved remarkable results. However, disease, especially cancer, is often caused by a group of features with complex interactions. Unlike traditional feature selection methods, which only focused on finding single discriminative features, a multilayer feature subset selection method (MLFSSM), which employs randomized search and multilayer structure to select a discriminative subset, is proposed herein. In each level of this method, many feature subsets are generated to assure the diversity of the combinations, and the weights of features are evaluated on the performances of the subsets. The weight of a feature would increase if the feature is selected into more subsets with better performances compared with other features on the current layer. In this manner, the values of feature weights are revised layer-by-layer; the precision of feature weights is constantly improved; and better subsets are repeatedly constructed by the features with higher weights. Finally, the topmost feature subset of the last layer is returned. The experimental results based on five public gene datasets showed that the subsets selected by MLFSSM were more discriminative than the results by traditional feature methods including LVW (a feature subset method used the Las Vegas method for randomized search strategy), GAANN (a feature subset selection method based genetic algorithm (GA)), and support vector machine recursive feature elimination (SVM-RFE). Furthermore, MLFSSM showed higher classification performance than some state-of-the-art methods which selected feature pairs or groups, including top scoring pair (TSP), k-top scoring pairs (K-TSP), and relative simplicity-based direct classifier (RS-DC).
Collapse
|
17
|
Brankovic A, Hosseini M, Piroddi L. A Distributed Feature Selection Algorithm Based on Distance Correlation with an Application to Microarrays. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1802-1815. [PMID: 29993889 DOI: 10.1109/tcbb.2018.2833482] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
DNA microarray datasets are characterized by a large number of features with very few samples, which is a typical cause of overfitting and poor generalization in the classification task. Here, we introduce a novel feature selection (FS) approach which employs the distance correlation (dCor) as a criterion for evaluating the dependence of the class on a given feature subset. The dCor index provides a reliable dependence measure among random vectors of arbitrary dimension, without any assumption on their distribution. Moreover, it is sensitive to the presence of redundant terms. The proposed FS method is based on a probabilistic representation of the feature subset model, which is progressively refined by a repeated process of model extraction and evaluation. A key element of the approach is a distributed optimization scheme based on a vertical partitioning of the dataset, which alleviates the negative effects of its unbalanced dimensions. The proposed method has been tested on several microarray datasets, resulting in quite compact and accurate models obtained at a reasonable computational cost.
Collapse
|
18
|
Chen M, Zhang Y, Li Z, Li A, Liu W, Liu L, Chen Z. A Novel Gene Selection Algorithm based on Sparse Representation and Minimum-redundancy Maximum-relevancy of Maximum Compatibility Center. CURR PROTEOMICS 2019. [DOI: 10.2174/1570164616666190123144020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Tumor classification is important for accurate diagnosis and personalized
treatment and has recently received great attention. Analysis of gene expression profile has shown relevant
biological significance and thus has become a research hotspot and a new challenge for bio-data
mining. In the research methods, some algorithms can identify few genes but with great time
complexity, some algorithms can get small time complex methods but with unsatisfactory classification
accuracy, this article proposed a new extraction method for gene expression profile.
Methods:
In this paper, we propose a classification method for tumor subtypes based on the Minimum-
Redundancy Maximum-Relevancy (MRMR) of maximum compatibility center. First, we performed a
fuzzy clustering of gene expression profiles based on the compatibility relation. Next, we used the
sparse representation coefficient to assess the importance of the gene for the category, extracted the
top-ranked genes, and removed the uncorrelated genes. Finally, the MRMR search strategy was used to
select the characteristic gene, reject the redundant gene, and obtain the final subset of characteristic
genes.
Results:
Our method and four others were tested on four different datasets to verify its effectiveness.
Results show that the classification accuracy and standard deviation of our method are better than
those of other methods.
Conclusion:
Our proposed method is robust, adaptable, and superior in classification. This method can
help us discover the susceptibility genes associated with complex diseases and understand the interaction
between these genes. Our technique provides a new way of thinking and is important to understand
the pathogenesis of complex diseases and prevent diseases, diagnosis and treatment.
Collapse
Affiliation(s)
- Min Chen
- School of Computer Science and Technology, Hunan Institute of Technology, 421002 Hengyang, China
| | - Yi Zhang
- School of Information Science and Engineering, Guilin University of Technology, 541004 Guilin, China
| | - Zejun Li
- School of Computer Science and Technology, Hunan Institute of Technology, 421002 Hengyang, China
| | - Ang Li
- School of Computer Science and Technology, Hunan Institute of Technology, 421002 Hengyang, China
| | - Wenhua Liu
- School of Computer Science and Technology, Hunan Institute of Technology, 421002 Hengyang, China
| | - Liubin Liu
- Cloud Collaboration Technology Group, Cisco System Inc., 95035 Milpitas, CA, United States
| | - Zheng Chen
- School of Computer Science and Technology, Hunan Institute of Technology, 421002 Hengyang, China
| |
Collapse
|
19
|
Momenzadeh M, Sehhati M, Rabbani H. A novel feature selection method for microarray data classification based on hidden Markov model. J Biomed Inform 2019; 95:103213. [PMID: 31128258 DOI: 10.1016/j.jbi.2019.103213] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2018] [Revised: 04/28/2019] [Accepted: 05/17/2019] [Indexed: 10/26/2022]
Abstract
In this paper, a novel approach is introduced for integrating multiple feature selection criteria by using hidden Markov model (HMM). For this purpose, five feature selection ranking methods including Bhattacharyya distance, entropy, receiver operating characteristic curve, t-test, and Wilcoxon are used in the proposed topology of HMM. Here, we presented a strategy for constructing, learning and inferring the HMM for gene selection, which led to higher performance in cancer classification. In this experiment, three publicly available microarray datasets including diffuse large B-cell lymphoma, leukemia cancer and prostate were used for evaluation. Results demonstrated the higher performance of the proposed HMM-based gene selection against Markov chain rank aggregation and using individual feature selection criterion, where applied to general classifiers. In conclusion, the proposed approach is a powerful procedure for combining different feature selection methods, which can be used for more robust classification in real world applications.
Collapse
Affiliation(s)
- Mohammadreza Momenzadeh
- Department of Bioelectric and Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Mohammadreza Sehhati
- Department of Bioelectric and Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran; Medical Image and Signal Processing Research Center, Isfahan University of Medical Sciences, Isfahan, Iran.
| | - Hossein Rabbani
- Department of Bioelectric and Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran; Medical Image and Signal Processing Research Center, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|
20
|
Machine learning technology in the application of genome analysis: A systematic review. Gene 2019; 705:149-156. [PMID: 31026571 DOI: 10.1016/j.gene.2019.04.062] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 04/17/2019] [Accepted: 04/22/2019] [Indexed: 01/17/2023]
Abstract
Machine learning (ML) is a powerful technique to tackle many problems in data mining and predictive analytics. We believe that ML will be of considerable potentials in the field of bioinformatics since the high-throughput technology is producing ever increasing biological data. In this review, we summarized major ML algorithms and conditions that must be paid attention to when applying these algorithms to genomic problems in details and we provided a list of examples from different perspectives and data analysis challenges at present.
Collapse
|
21
|
Abstract
Abstract
Nowadays, being in digital era the data generated by various applications are increasing drastically both row-wise and column wise; this creates a bottleneck for analytics and also increases the burden of machine learning algorithms that work for pattern recognition. This cause of dimensionality can be handled through reduction techniques. The Dimensionality Reduction (DR) can be handled in two ways namely Feature Selection (FS) and Feature Extraction (FE). This paper focuses on a survey of feature selection methods, from this extensive survey we can conclude that most of the FS methods use static data. However, after the emergence of IoT and web-based applications, the data are generated dynamically and grow in a fast rate, so it is likely to have noisy data, it also hinders the performance of the algorithm. With the increase in the size of the data set, the scalability of the FS methods becomes jeopardized. So the existing DR algorithms do not address the issues with the dynamic data. Using FS methods not only reduces the burden of the data but also avoids overfitting of the model.
Collapse
|
22
|
Abstract
Abstract
Nowadays, being in digital era the data generated by various applications are increasing drastically both row-wise and column wise; this creates a bottleneck for analytics and also increases the burden of machine learning algorithms that work for pattern recognition. This cause of dimensionality can be handled through reduction techniques. The Dimensionality Reduction (DR) can be handled in two ways namely Feature Selection (FS) and Feature Extraction (FE). This paper focuses on a survey of feature selection methods, from this extensive survey we can conclude that most of the FS methods use static data. However, after the emergence of IoT and web-based applications, the data are generated dynamically and grow in a fast rate, so it is likely to have noisy data, it also hinders the performance of the algorithm. With the increase in the size of the data set, the scalability of the FS methods becomes jeopardized. So the existing DR algorithms do not address the issues with the dynamic data. Using FS methods not only reduces the burden of the data but also avoids overfitting of the model.
Collapse
|
23
|
Clustering, Pathway Enrichment, and Protein-Protein Interaction Analysis of Gene Expression in Neurodevelopmental Disorders. Adv Pharmacol Sci 2018; 2018:3632159. [PMID: 30598663 PMCID: PMC6288580 DOI: 10.1155/2018/3632159] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Accepted: 10/30/2018] [Indexed: 12/21/2022] Open
Abstract
Neuronal developmental disorder is a class of diseases in which there is impairment of the central nervous system and brain function. The brain in its developmental phase undergoes tremendous changes depending upon the stage and environmental factors. Neurodevelopmental disorders include abnormalities associated with cognitive, speech, reading, writing, linguistic, communication, and growth disorders with lifetime effects. Computational methods provide great potential for betterment of research and insight into the molecular mechanism of diseases. In this study, we have used four samples of microarray neuronal developmental data: control, RV (resveratrol), NGF (nerve growth factor), and RV + NGF. By using computational methods, we have identified genes that are expressed in the early stage of neuronal development and also involved in neuronal diseases. We have used MeV application to cluster the raw data using distance metric Pearson correlation coefficient. Finally, 60 genes were selected on the basis of coexpression analysis. Further pathway analysis was done using the Metascape tool, and the biological process was studied using gene ontology database. A total of 13 genes AKT1, BAD, BAX, BCL2, BDNF, CASP3, CASP8, CASP9, MYC, PIK3CD, MAPK1, MAPK10, and CYCS were identified that are common in all clusters. These genes are involved in neuronal developmental disorders and cancers like colorectal cancer, apoptosis, tuberculosis, amyotrophic lateral sclerosis (ALS), neuron death, and prostate cancer pathway. A protein-protein interaction study was done to identify proteins that belong to the same pathway. These genes can be used to design potential inhibitors against neurological disorders at the early stage of neuronal development. The microarray samples discussed in this publication are part of the data deposited in NCBI's Gene Expression Omnibus (Yadav et al., 2018) and are accessible through GEO Series (accession number GSE121261).
Collapse
|
24
|
Wu P, Wang D. Classification of a DNA Microarray for Diagnosing Cancer Using a Complex Network Based Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:801-808. [PMID: 30183642 DOI: 10.1109/tcbb.2018.2868341] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Applications that classify DNA microarray expression data are helpful for diagnosing cancer. Many attempts have been made to analyze these data; however, new methods are needed to obtain better results. In this study, a Complex Network (CN) classifier was exploited to implement the classification task. An algorithm was used to initialize the structure, which allowed input variables to be selected over layered connections and different activation functions for different nodes. Then, a hybrid method integrated the Genetic Programming and the Particle Swarm Optimization algorithms was used to identify an optimal structure with the parameters encoded in the classifier. The single CN classifier and an ensemble of CN classifiers were tested on four bench data sets. To ensure diversity of the ensemble classifiers, we constructed a base classifier using different feature sets, i.e., Pearson's correlation, Spearman's correlation, Euclidean distance, Cosine coefficient and the Fisher-ratio. The experimental results suggest that a single classifier can be used to obtain state-of-the-art results and the ensemble yielded better results.
Collapse
|
25
|
Bhowmick SS, Saha I, Bhattacharjee D, Genovese LM, Geraci F. Genome-wide analysis of NGS data to compile cancer-specific panels of miRNA biomarkers. PLoS One 2018; 13:e0200353. [PMID: 30048452 PMCID: PMC6061989 DOI: 10.1371/journal.pone.0200353] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2017] [Accepted: 06/25/2018] [Indexed: 12/22/2022] Open
Abstract
MicroRNAs are small non-coding RNAs that influence gene expression by binding to the 3’ UTR of target mRNAs in order to repress protein synthesis. Soon after discovery, microRNA dysregulation has been associated to several pathologies. In particular, they have often been reported as differentially expressed in healthy and tumor samples. This fact suggested that microRNAs are likely to be good candidate biomarkers for cancer diagnosis and personalized medicine. With the advent of Next-Generation Sequencing (NGS), measuring the expression level of the whole miRNAome at once is now routine. Yet, the collaborative effort of sharing data opens to the possibility of population analyses. This context motivated us to perform an in-silico study to distill cancer-specific panels of microRNAs that can serve as biomarkers. We observed that the problem of finding biomarkers can be modeled as a two-class classification task where, given the miRNAomes of a population of healthy and cancerous samples, we want to find the subset of microRNAs that leads to the highest classification accuracy. We fulfill this task leveraging on a sensible combination of data mining tools. In particular, we used: differential evolution for candidate selection, component analysis to preserve the relationships among miRNAs, and SVM for sample classification. We identified 10 cancer-specific panels whose classification accuracy is always higher than 92%. These panels have a very little overlap suggesting that miRNAs are not only predictive of the onset of cancer, but can be used for classification purposes as well. We experimentally validated the contribution of each of the employed tools to the selection of discriminating miRNAs. Moreover, we tested the significance of each panel for the corresponding cancer type. In particular, enrichment analysis showed that the selected miRNAs are involved in oncogenesis pathways, while survival analysis proved that miRNAs can be used to evaluate cancer severity. Summarizing: results demonstrated that our method is able to produce cancer-specific panels that are promising candidates for a subsequent in vitro validation.
Collapse
Affiliation(s)
- Shib Sankar Bhowmick
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- Department of Electronics & Communication Engineering, Heritage Institute of Technology, Kolkata, India
| | - Indrajit Saha
- Department of Computer Science and Engineering, National Institute of Technical Teachers’ Training & Research, Kolkata, India
- * E-mail:
| | | | - Loredana M. Genovese
- Institute for Informatics and telematics, National Research Council, Pisa, Italy
| | - Filippo Geraci
- Institute for Informatics and telematics, National Research Council, Pisa, Italy
| |
Collapse
|
26
|
Corrales DC, Lasso E, Ledezma A, Corrales JC. Feature selection for classification tasks: Expert knowledge or traditional methods? JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2018. [DOI: 10.3233/jifs-169470] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- David Camilo Corrales
- Universidad del Cauca, Grupo de Ingeniería Telemática, Campus Tulcán, Popayán, Colombia
- Universidad Carlos III de Madrid, Departamento de Ciencias de la Computación e Ingeniería, Avenida de la Universidad, Leganés, Spain
| | - Emmanuel Lasso
- Universidad del Cauca, Grupo de Ingeniería Telemática, Campus Tulcán, Popayán, Colombia
| | - Agapito Ledezma
- Universidad Carlos III de Madrid, Departamento de Ciencias de la Computación e Ingeniería, Avenida de la Universidad, Leganés, Spain
| | - Juan Carlos Corrales
- Universidad del Cauca, Grupo de Ingeniería Telemática, Campus Tulcán, Popayán, Colombia
| |
Collapse
|
27
|
Optimal and Novel Hybrid Feature Selection Framework for Effective Data Classification. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/978-981-10-4762-6_48] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
|
28
|
Hameed SS, Hassan R, Muhammad FF. Selection and classification of gene expression in autism disorder: Use of a combination of statistical filters and a GBPSO-SVM algorithm. PLoS One 2017; 12:e0187371. [PMID: 29095904 PMCID: PMC5667738 DOI: 10.1371/journal.pone.0187371] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2017] [Accepted: 10/18/2017] [Indexed: 11/30/2022] Open
Abstract
In this work, gene expression in autism spectrum disorder (ASD) is analyzed with the goal of selecting the most attributed genes and performing classification. The objective was achieved by utilizing a combination of various statistical filters and a wrapper-based geometric binary particle swarm optimization-support vector machine (GBPSO-SVM) algorithm. The utilization of different filters was accentuated by incorporating a mean and median ratio criterion to remove very similar genes. The results showed that the most discriminative genes that were identified in the first and last selection steps included the presence of a repetitive gene (CAPS2), which was assigned as the gene most highly related to ASD risk. The merged gene subset that was selected by the GBPSO-SVM algorithm was able to enhance the classification accuracy.
Collapse
Affiliation(s)
- Shilan S. Hameed
- Department of Computer Science, Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia
- Department of Software and Informatics Engineering, College of Engineering, Salahaddin University, Erbil, Kurdistan Region, Iraq
| | - Rohayanti Hassan
- Department of Software Engineering, Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia
| | - Fahmi F. Muhammad
- Department of Physics, Faculty of Science & Health, Koya University, Koya, Kurdistan Region, Iraq
| |
Collapse
|
29
|
Martina F, Beccuti M, Balbo G, Cordero F. Peculiar Genes Selection: A new features selection method to improve classification performances in imbalanced data sets. PLoS One 2017; 12:e0177475. [PMID: 28806759 PMCID: PMC5555681 DOI: 10.1371/journal.pone.0177475] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2017] [Accepted: 04/27/2017] [Indexed: 11/18/2022] Open
Abstract
High-Throughput technologies provide genomic and trascriptomic data that are suitable for biomarker detection for classification purposes. However, the high dimension of the output of such technologies and the characteristics of the data sets analysed represent an issue for the classification task. Here we present a new feature selection method based on three steps to detect class-specific biomarkers in case of high-dimensional data sets. The first step detects the differentially expressed genes according to the experimental conditions tested in the experimental design, the second step filters out the features with low discriminative power and the third step detects the class-specific features and defines the final biomarker as the union of the class-specific features. The proposed procedure is tested on two microarray datasets, one characterized by a strong imbalance between the size of classes and the other one where the size of classes is perfectly balanced. We show that, using the proposed feature selection procedure, the classification performances of a Support Vector Machine on the imbalanced data set reach a 82% whereas other methods do not exceed 73%. Furthermore, in case of perfectly balanced dataset, the classification performances are comparable with other methods. Finally, the Gene Ontology enrichments performed on the signatures selected with the proposed pipeline, confirm the biological relevance of our methodology. The download of the package with the implementation of Peculiar Genes Selection, 'PGS', is available for R users at: http://github.com/mbeccuti/PGS.
Collapse
Affiliation(s)
- Federica Martina
- Computer Science Department, University of Turin, Turin, Italy
- GSK Vaccines, Siena, Italy
- * E-mail:
| | - Marco Beccuti
- Computer Science Department, University of Turin, Turin, Italy
| | | | | |
Collapse
|
30
|
Garcia-Chimeno Y, Garcia-Zapirain B, Gomez-Beldarrain M, Fernandez-Ruanova B, Garcia-Monco JC. Automatic migraine classification via feature selection committee and machine learning techniques over imaging and questionnaire data. BMC Med Inform Decis Mak 2017; 17:38. [PMID: 28407777 PMCID: PMC5390380 DOI: 10.1186/s12911-017-0434-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2016] [Accepted: 03/29/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Feature selection methods are commonly used to identify subsets of relevant features to facilitate the construction of models for classification, yet little is known about how feature selection methods perform in diffusion tensor images (DTIs). In this study, feature selection and machine learning classification methods were tested for the purpose of automating diagnosis of migraines using both DTIs and questionnaire answers related to emotion and cognition - factors that influence of pain perceptions. METHODS We select 52 adult subjects for the study divided into three groups: control group (15), subjects with sporadic migraine (19) and subjects with chronic migraine and medication overuse (18). These subjects underwent magnetic resonance with diffusion tensor to see white matter pathway integrity of the regions of interest involved in pain and emotion. The tests also gather data about pathology. The DTI images and test results were then introduced into feature selection algorithms (Gradient Tree Boosting, L1-based, Random Forest and Univariate) to reduce features of the first dataset and classification algorithms (SVM (Support Vector Machine), Boosting (Adaboost) and Naive Bayes) to perform a classification of migraine group. Moreover we implement a committee method to improve the classification accuracy based on feature selection algorithms. RESULTS When classifying the migraine group, the greatest improvements in accuracy were made using the proposed committee-based feature selection method. Using this approach, the accuracy of classification into three types improved from 67 to 93% when using the Naive Bayes classifier, from 90 to 95% with the support vector machine classifier, 93 to 94% in boosting. The features that were determined to be most useful for classification included are related with the pain, analgesics and left uncinate brain (connected with the pain and emotions). CONCLUSIONS The proposed feature selection committee method improved the performance of migraine diagnosis classifiers compared to individual feature selection methods, producing a robust system that achieved over 90% accuracy in all classifiers. The results suggest that the proposed methods can be used to support specialists in the classification of migraines in patients undergoing magnetic resonance imaging.
Collapse
Affiliation(s)
- Yolanda Garcia-Chimeno
- DeustoTech - Fundacion Deusto, Avda. Universidades, 24, Bilbao, 48007 Spain
- Facultad IngenieriaUniversidad de Deusto, Avda. Universidades, 24, Bilbao, 48007 Spain
| | - Begonya Garcia-Zapirain
- DeustoTech - Fundacion Deusto, Avda. Universidades, 24, Bilbao, 48007 Spain
- Facultad IngenieriaUniversidad de Deusto, Avda. Universidades, 24, Bilbao, 48007 Spain
| | - Marian Gomez-Beldarrain
- Service of Neurology Hospital de Galdakao-Usansolo, Barrio Labeaga, S/N, Galdakao, 48960 Spain
| | | | - Juan Carlos Garcia-Monco
- Research and Innovation Department, Magnetic Resonance Imaging Unit, OSATEK, Alameda Urquijo, 36, Bilbao, 48011 Spain
| |
Collapse
|
31
|
Hybridizing Cartesian Genetic Programming and Harmony Search for adaptive feature construction in supervised learning problems. Appl Soft Comput 2017. [DOI: 10.1016/j.asoc.2016.09.049] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
32
|
Roy A. Examining dynamic functional relationships in a pathological brain using evolutionary computation. Soft comput 2017. [DOI: 10.1007/s00500-017-2496-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
33
|
A Meta-Review of Feature Selection Techniques in the Context of Microarray Data. BIOINFORMATICS AND BIOMEDICAL ENGINEERING 2017. [DOI: 10.1007/978-3-319-56148-6_3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
34
|
Gupta MK, Behara SK, Vadde R. In silico analysis of differential gene expressions in biliary stricture and hepatic carcinoma. Gene 2016; 597:49-58. [PMID: 27777109 DOI: 10.1016/j.gene.2016.10.032] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2016] [Revised: 10/15/2016] [Accepted: 10/19/2016] [Indexed: 12/16/2022]
Abstract
In-silico attempt was made to identify the key hub genes which get differentially expressed in biliary stricture and hepatic carcinoma. Gene expression data, GSE34166, was downloaded from the GEO database, which contains 10 biliary stricture samples (4 benign control and 6 malignant carcinoma), for screening of key hub genes associated with the disease. R packages scripts were identified 85 differentially expressed genes. Further these genes were uploaded in WebGestalt database and identified nine key genes. Using STRING database and Gephi software, the protein-protein interaction networks were constructed and also studied gene ontology through WebGestalt. Finally, we identified four key genes (CXCR4, ADH1C, ABCB1 and ADH1A) are associated with liver carcinoma and further cross-validated with Liverome, Protein Atlas database and bibliography. In addition, transcription factors and their binding sites also studied. These identified hub genes and their transcription factors are the probable potential targets for possible future drug design.
Collapse
Affiliation(s)
- Manoj Kumar Gupta
- Department of Biotechnology & Bioinformatics, Yogi Vemana University, Kadapa 516003, Andhra Pradesh, India.
| | - Santosh Kumar Behara
- Biomedical Informatics Centre, Regional Medical Research Centre (ICMR), Bhubaneswar 751023, Odisha, India.
| | - Ramakrishna Vadde
- Department of Biotechnology & Bioinformatics, Yogi Vemana University, Kadapa 516003, Andhra Pradesh, India.
| |
Collapse
|
35
|
Chen D, Sarkar S, Candia J, Florczyk SJ, Bodhak S, Driscoll MK, Simon CG, Dunkers JP, Losert W. Machine learning based methodology to identify cell shape phenotypes associated with microenvironmental cues. Biomaterials 2016; 104:104-18. [PMID: 27449947 PMCID: PMC11305428 DOI: 10.1016/j.biomaterials.2016.06.040] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 06/17/2016] [Accepted: 06/19/2016] [Indexed: 01/02/2023]
Abstract
Cell morphology has been identified as a potential indicator of stem cell response to biomaterials. However, determination of cell shape phenotype in biomaterials is complicated by heterogeneous cell populations, microenvironment heterogeneity, and multi-parametric definitions of cell morphology. To associate cell morphology with cell-material interactions, we developed a shape phenotyping framework based on support vector machines. A feature selection procedure was implemented to select the most significant combination of cell shape metrics to build classifiers with both accuracy and stability to identify and predict microenvironment-driven morphological differences in heterogeneous cell populations. The analysis was conducted at a multi-cell level, where a "supercell" method used average shape measurements of small groups of single cells to account for heterogeneous populations and microenvironment. A subsampling validation algorithm revealed the range of supercell sizes and sample sizes needed for classifier stability and generalization capability. As an example, the responses of human bone marrow stromal cells (hBMSCs) to fibrous vs flat microenvironments were compared on day 1. Our analysis showed that 57 cells (grouped into supercells of size 4) are the minimum needed for phenotyping. The analysis identified that a combination of minor axis length, solidity, and mean negative curvature were the strongest early shape-based indicator of hBMSCs response to fibrous microenvironment.
Collapse
Affiliation(s)
- Desu Chen
- Biophysics Program, University of Maryland, College Park, MD, United States
| | - Sumona Sarkar
- Biosystems & Biomaterials Division, National Institute of Standards & Technology, Gaithersburg, MD, United States
| | - Julián Candia
- Department of Physics, University of Maryland, College Park, MD, United States; School of Medicine, University of Maryland, Baltimore, MD, United States; Center for Human Immunology, National Institutes of Health, Bethesda, MD, United States
| | - Stephen J Florczyk
- Biosystems & Biomaterials Division, National Institute of Standards & Technology, Gaithersburg, MD, United States
| | - Subhadip Bodhak
- Biosystems & Biomaterials Division, National Institute of Standards & Technology, Gaithersburg, MD, United States
| | - Meghan K Driscoll
- Department of Physics, University of Maryland, College Park, MD, United States
| | - Carl G Simon
- Biosystems & Biomaterials Division, National Institute of Standards & Technology, Gaithersburg, MD, United States
| | - Joy P Dunkers
- Biosystems & Biomaterials Division, National Institute of Standards & Technology, Gaithersburg, MD, United States
| | - Wolfgang Losert
- Department of Physics, University of Maryland, College Park, MD, United States.
| |
Collapse
|
36
|
Lai HM, Albrecht AA, Steinhöfel KK. iRDA: a new filter towards predictive, stable, and enriched candidate genes. BMC Genomics 2015; 16:1041. [PMID: 26647162 PMCID: PMC4673793 DOI: 10.1186/s12864-015-2129-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Accepted: 10/22/2015] [Indexed: 11/28/2022] Open
Abstract
Background Gene expression profiling using high-throughput screening (HTS) technologies allows clinical researchers to find prognosis gene signatures that could better discriminate between different phenotypes and serve as potential biological markers in disease diagnoses. In recent years, many feature selection methods have been devised for finding such discriminative genes, and more recently information theoretic filters have also been introduced for capturing feature-to-class relevance and feature-to-feature correlations in microarray-based classification. Methods In this paper, we present and fully formulate a new multivariate filter, iRDA, for the discovery of HTS gene-expression candidate genes. The filter constitutes a four-step framework and includes feature relevance, feature redundancy, and feature interdependence in the context of feature-pairs. The method is based upon approximate Markov blankets, information theory, several heuristic search strategies with forward, backward and insertion phases, and the method is aiming at higher order gene interactions. Results To show the strengths of iRDA, three performance measures, two evaluation schemes, two stability index sets, and the gene set enrichment analysis (GSEA) are all employed in our experimental studies. Its effectiveness has been validated by using seven well-known cancer gene-expression benchmarks and four other disease experiments, including a comparison to three popular information theoretic filters. In terms of classification performance, candidate genes selected by iRDA perform better than the sets discovered by the other three filters. Two stability measures indicate that iRDA is the most robust with the least variance. GSEA shows that iRDA produces more statistically enriched gene sets on five out of the six benchmark datasets. Conclusions Through the classification performance, the stability performance, and the enrichment analysis, iRDA is a promising filter to find predictive, stable, and enriched gene-expression candidate genes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2129-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hung-Ming Lai
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| | - Andreas A Albrecht
- School of Science and Technology, Middlesex University, Burroughs, London, NW4 4BT, UK.
| | - Kathleen K Steinhöfel
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| |
Collapse
|
37
|
Dimensionality Reduction in Complex Medical Data: Improved Self-Adaptive Niche Genetic Algorithm. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:794586. [PMID: 26649071 PMCID: PMC4663319 DOI: 10.1155/2015/794586] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/24/2015] [Revised: 09/24/2015] [Accepted: 10/04/2015] [Indexed: 02/05/2023]
Abstract
With the development of medical technology, more and more parameters are produced to describe the human physiological condition, forming high-dimensional clinical datasets. In clinical analysis, data are commonly utilized to establish mathematical models and carry out classification. High-dimensional clinical data will increase the complexity of classification, which is often utilized in the models, and thus reduce efficiency. The Niche Genetic Algorithm (NGA) is an excellent algorithm for dimensionality reduction. However, in the conventional NGA, the niche distance parameter is set in advance, which prevents it from adjusting to the environment. In this paper, an Improved Niche Genetic Algorithm (INGA) is introduced. It employs a self-adaptive niche-culling operation in the construction of the niche environment to improve the population diversity and prevent local optimal solutions. The INGA was verified in a stratification model for sepsis patients. The results show that, by applying INGA, the feature dimensionality of datasets was reduced from 77 to 10 and that the model achieved an accuracy of 92% in predicting 28-day death in sepsis patients, which is significantly higher than other methods.
Collapse
|
38
|
|
39
|
Hira ZM, Gillies DF. A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data. Adv Bioinformatics 2015; 2015:198363. [PMID: 26170834 PMCID: PMC4480804 DOI: 10.1155/2015/198363] [Citation(s) in RCA: 291] [Impact Index Per Article: 32.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2015] [Accepted: 05/18/2015] [Indexed: 02/07/2023] Open
Abstract
We summarise various ways of performing dimensionality reduction on high-dimensional microarray data. Many different feature selection and feature extraction methods exist and they are being widely used. All these methods aim to remove redundant and irrelevant features so that classification of new instances will be more accurate. A popular source of data is microarrays, a biological platform for gathering gene expressions. Analysing microarrays can be difficult due to the size of the data they provide. In addition the complicated relations among the different genes make analysis more difficult and removing excess features can improve the quality of the results. We present some of the most popular methods for selecting significant features and provide a comparison between them. Their advantages and disadvantages are outlined in order to provide a clearer idea of when to use each one of them for saving computational time and resources.
Collapse
Affiliation(s)
- Zena M. Hira
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | - Duncan F. Gillies
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| |
Collapse
|
40
|
Völkel G, Lausser L, Schmid F, Kraus JM, Kestler HA. Sputnik: ad hoc distributed computation. Bioinformatics 2015; 31:1298-301. [PMID: 25505087 DOI: 10.1093/bioinformatics/btu818] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Accepted: 12/05/2014] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION In bioinformatic applications, computationally demanding algorithms are often parallelized to speed up computation. Nevertheless, setting up computational environments for distributed computation is often tedious. Aim of this project were the lightweight ad hoc set up and fault-tolerant computation requiring only a Java runtime, no administrator rights, while utilizing all CPU cores most effectively. RESULTS The Sputnik framework provides ad hoc distributed computation on the Java Virtual Machine which uses all supplied CPU cores fully. It provides a graphical user interface for deployment setup and a web user interface displaying the current status of current computation jobs. Neither a permanent setup nor administrator privileges are required. We demonstrate the utility of our approach on feature selection of microarray data. AVAILABILITY AND IMPLEMENTATION The Sputnik framework is available on Github http://github.com/sysbio-bioinf/sputnik under the Eclipse Public License. CONTACT hkestler@fli-leibniz.de or hans.kestler@uni-ulm.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gunnar Völkel
- Core Unit Medical Systems Biology, Theoretical Computer Science, Ulm University, D-89069 Ulm, Germany and Leibniz Institute for Age Research-Fritz Lipmann Institute and FSU Jena, D-07745 Jena Core Unit Medical Systems Biology, Theoretical Computer Science, Ulm University, D-89069 Ulm, Germany and Leibniz Institute for Age Research-Fritz Lipmann Institute and FSU Jena, D-07745 Jena
| | - Ludwig Lausser
- Core Unit Medical Systems Biology, Theoretical Computer Science, Ulm University, D-89069 Ulm, Germany and Leibniz Institute for Age Research-Fritz Lipmann Institute and FSU Jena, D-07745 Jena
| | - Florian Schmid
- Core Unit Medical Systems Biology, Theoretical Computer Science, Ulm University, D-89069 Ulm, Germany and Leibniz Institute for Age Research-Fritz Lipmann Institute and FSU Jena, D-07745 Jena
| | - Johann M Kraus
- Core Unit Medical Systems Biology, Theoretical Computer Science, Ulm University, D-89069 Ulm, Germany and Leibniz Institute for Age Research-Fritz Lipmann Institute and FSU Jena, D-07745 Jena
| | - Hans A Kestler
- Core Unit Medical Systems Biology, Theoretical Computer Science, Ulm University, D-89069 Ulm, Germany and Leibniz Institute for Age Research-Fritz Lipmann Institute and FSU Jena, D-07745 Jena Core Unit Medical Systems Biology, Theoretical Computer Science, Ulm University, D-89069 Ulm, Germany and Leibniz Institute for Age Research-Fritz Lipmann Institute and FSU Jena, D-07745 Jena
| |
Collapse
|
41
|
Dessì N, Pes B, Cannas LM. An Evolutionary Approach for Balancing Effectiveness and Representation Level in Gene Selection. JOURNAL OF INFORMATION TECHNOLOGY RESEARCH 2015. [DOI: 10.4018/jitr.2015040102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
As data mining develops and expands to new application areas, feature selection also reveals various aspects to be considered. This paper underlines two aspects that seem to categorize the large body of available feature selection algorithms: the effectiveness and the representation level. The effectiveness deals with selecting the minimum set of variables that maximize the accuracy of a classifier and the representation level concerns discovering how relevant the variables are for the domain of interest. For balancing the above aspects, the paper proposes an evolutionary framework for feature selection that expresses a hybrid method, organized in layers, each of them exploits a specific model of search strategy. Extensive experiments on gene selection from DNA-microarray datasets are presented and discussed. Results indicate that the framework compares well with different hybrid methods proposed in literature as it has the capability of finding well suited subsets of informative features while improving classification accuracy.
Collapse
Affiliation(s)
- Nicoletta Dessì
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| | - Barbara Pes
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| | - Laura Maria Cannas
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| |
Collapse
|
42
|
Wang Y, Fan X, Cai Y. A comparative study of improvements Pre-filter methods bring on feature selection using microarray data. Health Inf Sci Syst 2014; 2:7. [PMID: 25825671 PMCID: PMC4340279 DOI: 10.1186/2047-2501-2-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Accepted: 10/03/2014] [Indexed: 12/13/2022] Open
Abstract
Background Feature selection techniques have become an apparent need in biomarker discoveries with the development of microarray. However, the high dimensional nature of microarray made feature selection become time-consuming. To overcome such difficulties, filter data according to the background knowledge before applying feature selection techniques has become a hot topic in microarray analysis. Different methods may affect final results greatly, thus it is important to evaluate these pre-filter methods in a system way. Methods In this paper, we compared the performance of statistical-based, biological-based pre-filter methods and the combination of them on microRNA-mRNA parallel expression profiles using L1 logistic regression as feature selection techniques. Four types of data were built for both microRNA and mRNA expression profiles. Results Results showed that pre-filter methods could reduce the number of features greatly for both mRNA and microRNA expression datasets. The features selected after pre-filter procedures were shown to be significant in biological levels such as biology process and microRNA functions. Analyses of classification performance based on precision showed the pre-filter methods were necessary when the number of raw features was much bigger than that of samples. All the computing time was greatly shortened after pre-filter procedures. Conclusions With similar or better classification improvements, less but biological significant features, pre-filter-based feature selection should be taken into consideration if researchers need fast results when facing complex computing problems in bioinformatics. Electronic supplementary material The online version of this article (doi:10.1186/2047-2501-2-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingying Wang
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Xiaomao Fan
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Yunpeng Cai
- Research Center for Biomedical Information, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| |
Collapse
|
43
|
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez J, Herrera F. A review of microarray datasets and applied feature selection methods. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.05.042] [Citation(s) in RCA: 386] [Impact Index Per Article: 38.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
44
|
Mahmoud O, Harrison A, Perperoglou A, Gul A, Khan Z, Metodiev MV, Lausen B. A feature selection method for classification within functional genomics experiments based on the proportional overlapping score. BMC Bioinformatics 2014; 15:274. [PMID: 25113817 PMCID: PMC4141116 DOI: 10.1186/1471-2105-15-274] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2014] [Accepted: 08/01/2014] [Indexed: 11/16/2022] Open
Abstract
Background Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature’s relevance to a classification task. Results We apply POS, along‐with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance. Conclusions A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along‐with a novel gene score are exploited to produce the selected subset of genes. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-274) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Osama Mahmoud
- Department of Mathematical Sciences, University of Essex, Wivenhoe Park, CO4 3SQ Colchester, UK.
| | | | | | | | | | | | | |
Collapse
|
45
|
A comparative analysis of swarm intelligence techniques for feature selection in cancer classification. ScientificWorldJournal 2014; 2014:693831. [PMID: 25157377 PMCID: PMC4137534 DOI: 10.1155/2014/693831] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2014] [Accepted: 06/18/2014] [Indexed: 11/17/2022] Open
Abstract
Feature selection in cancer classification is a central area of research in the field of bioinformatics and used to select the informative genes from thousands of genes of the microarray. The genes are ranked based on T-statistics, signal-to-noise ratio (SNR), and F-test values. The swarm intelligence (SI) technique finds the informative genes from the top-m ranked genes. These selected genes are used for classification. In this paper the shuffled frog leaping with Lévy flight (SFLLF) is proposed for feature selection. In SFLLF, the Lévy flight is included to avoid premature convergence of shuffled frog leaping (SFL) algorithm. The SI techniques such as particle swarm optimization (PSO), cuckoo search (CS), SFL, and SFLLF are used for feature selection which identifies informative genes for classification. The k-nearest neighbour (k-NN) technique is used to classify the samples. The proposed work is applied on 10 different benchmark datasets and examined with SI techniques. The experimental results show that the results obtained from k-NN classifier through SFLLF feature selection method outperform PSO, CS, and SFL.
Collapse
|
46
|
Islam AKMT, Jeong BS, Bari ATMG, Lim CG, Jeon SH. MapReduce based parallel gene selection method. APPL INTELL 2014. [DOI: 10.1007/s10489-014-0561-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
47
|
Kim KJ, Cho SB. Meta-classifiers for high-dimensional, small sample classification for gene expression analysis. Pattern Anal Appl 2014. [DOI: 10.1007/s10044-014-0369-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
48
|
Abel L, Kutschki S, Turewicz M, Eisenacher M, Stoutjesdijk J, Meyer HE, Woitalla D, May C. Autoimmune profiling with protein microarrays in clinical applications. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2014; 1844:977-87. [PMID: 24607371 DOI: 10.1016/j.bbapap.2014.02.023] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2013] [Revised: 02/18/2014] [Accepted: 02/27/2014] [Indexed: 02/05/2023]
Abstract
In recent years, knowledge about immune-related disorders has substantially increased, especially in the field of central nervous system (CNS) disorders. Recent innovations in protein-related microarray technology have enabled the analysis of interactions between numerous samples and up to 20,000 targets. Antibodies directed against ion channels, receptors and other synaptic proteins have been identified, and their causative roles in different disorders have been identified. Knowledge about immunological disorders is likely to expand further as more antibody targets are discovered. Therefore, protein microarrays may become an established tool for routine diagnostic procedures in the future. The identification of relevant target proteins requires the development of new strategies to handle and process vast quantities of data so that these data can be evaluated and correlated with relevant clinical issues, such as disease progression, clinical manifestations and prognostic factors. This review will mainly focus on new protein array technologies, which allow the processing of a large number of samples, and their various applications with a deeper insight into their potential use as diagnostic tools in neurodegenerative diseases and other diseases. This article is part of a Special Issue entitled: Biomarkers: A Proteomic Challenge.
Collapse
Affiliation(s)
- Laura Abel
- Department of Medical Proteomics/Bioanalytics, Medizinisches Proteom-Center, Ruhr-Universität Bochum, 44801 Bochum, Germany
| | - Simone Kutschki
- Department of Medical Proteomics/Bioanalytics, Medizinisches Proteom-Center, Ruhr-Universität Bochum, 44801 Bochum, Germany
| | - Michael Turewicz
- Department of Medical Proteomics/Bioanalytics, Medizinisches Proteom-Center, Ruhr-Universität Bochum, 44801 Bochum, Germany
| | - Martin Eisenacher
- Department of Medical Proteomics/Bioanalytics, Medizinisches Proteom-Center, Ruhr-Universität Bochum, 44801 Bochum, Germany
| | - Jale Stoutjesdijk
- Department of Medical Proteomics/Bioanalytics, Medizinisches Proteom-Center, Ruhr-Universität Bochum, 44801 Bochum, Germany
| | - Helmut E Meyer
- Department of Medical Proteomics/Bioanalytics, Medizinisches Proteom-Center, Ruhr-Universität Bochum, 44801 Bochum, Germany; Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany
| | - Dirk Woitalla
- S. Josef Hospital, Ruhr-University Bochum, 44780 Bochum, Germany; St. Josef-Krankenhaus Kupferdreh, Heidbergweg 22-24, 45257 Essen, Germany
| | - Caroline May
- Department of Medical Proteomics/Bioanalytics, Medizinisches Proteom-Center, Ruhr-Universität Bochum, 44801 Bochum, Germany.
| |
Collapse
|
49
|
Li S, Kang L, Zhao XM. A survey on evolutionary algorithm based hybrid intelligence in bioinformatics. BIOMED RESEARCH INTERNATIONAL 2014; 2014:362738. [PMID: 24729969 PMCID: PMC3963368 DOI: 10.1155/2014/362738] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/03/2013] [Revised: 01/29/2014] [Accepted: 01/29/2014] [Indexed: 11/18/2022]
Abstract
With the rapid advance in genomics, proteomics, metabolomics, and other types of omics technologies during the past decades, a tremendous amount of data related to molecular biology has been produced. It is becoming a big challenge for the bioinformatists to analyze and interpret these data with conventional intelligent techniques, for example, support vector machines. Recently, the hybrid intelligent methods, which integrate several standard intelligent approaches, are becoming more and more popular due to their robustness and efficiency. Specifically, the hybrid intelligent approaches based on evolutionary algorithms (EAs) are widely used in various fields due to the efficiency and robustness of EAs. In this review, we give an introduction about the applications of hybrid intelligent methods, in particular those based on evolutionary algorithm, in bioinformatics. In particular, we focus on their applications to three common problems that arise in bioinformatics, that is, feature selection, parameter estimation, and reconstruction of biological networks.
Collapse
Affiliation(s)
- Shan Li
- Department of Mathematics, Shanghai University, Shanghai 200444, China
| | - Liying Kang
- Department of Mathematics, Shanghai University, Shanghai 200444, China
| | - Xing-Ming Zhao
- Department of Computer Science, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| |
Collapse
|
50
|
Aflakparast M, Salimi H, Gerami A, Dubé MP, Visweswaran S, Masoudi-Nejad A. Cuckoo search epistasis: a new method for exploring significant genetic interactions. Heredity (Edinb) 2014; 112:666-74. [PMID: 24549111 DOI: 10.1038/hdy.2014.4] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2013] [Revised: 12/09/2013] [Accepted: 12/18/2013] [Indexed: 11/09/2022] Open
Abstract
The advent of high-throughput sequencing technology has resulted in the ability to measure millions of single-nucleotide polymorphisms (SNPs) from thousands of individuals. Although these high-dimensional data have paved the way for better understanding of the genetic architecture of common diseases, they have also given rise to challenges in developing computational methods for learning epistatic relationships among genetic markers. We propose a new method, named cuckoo search epistasis (CSE) for identifying significant epistatic interactions in population-based association studies with a case-control design. This method combines a computationally efficient Bayesian scoring function with an evolutionary-based heuristic search algorithm, and can be efficiently applied to high-dimensional genome-wide SNP data. The experimental results from synthetic data sets show that CSE outperforms existing methods including multifactorial dimensionality reduction and Bayesian epistasis association mapping. In addition, on a real genome-wide data set related to Alzheimer's disease, CSE identified SNPs that are consistent with previously reported results, and show the utility of CSE for application to genome-wide data.
Collapse
Affiliation(s)
- M Aflakparast
- 1] Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran [2] Department of Mathematics, Faculty of Sciences, VU University, Amsterdam, The Netherlands
| | - H Salimi
- Department of Computer Science, University of Tehran, Tehran, Iran
| | - A Gerami
- Department of Statistics and Mathematics, Islamic Azad University, Qazvin Branch, Qazvin, Iran
| | - M-P Dubé
- Department of Medicine, Faculty of Medicine, University of Montreal, Montreal, Quebec, Canada
| | - S Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - A Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| |
Collapse
|