1
|
Bhardwaj P, Tyagi A, Tyagi S, Antão J, Deng Q. Machine learning model for classification of predominantly allergic and non-allergic asthma among preschool children with asthma hospitalization. J Asthma 2023; 60:487-495. [PMID: 35344453 DOI: 10.1080/02770903.2022.2059763] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
OBJECTIVE Asthma is the most frequent chronic airway illness in preschool children and is difficult to diagnose due to the disease's heterogeneity. This study aimed to investigate different machine learning models and suggested the most effective one to classify two forms of asthma in preschool children (predominantly allergic asthma and non-allergic asthma) using a minimum number of features. METHODS After pre-processing, 127 patients (70 with non-allergic asthma and 57 with predominantly allergic asthma) were chosen for final analysis from the Frankfurt dataset, which had asthma-related information on 205 patients. The Random Forest algorithm and Chi-square were used to select the key features from a total of 63 features. Six machine learning models: random forest, extreme gradient boosting, support vector machines, adaptive boosting, extra tree classifier, and logistic regression were then trained and tested using 10-fold stratified cross-validation. RESULTS Among all features, age, weight, C-reactive protein, eosinophilic granulocytes, oxygen saturation, pre-medication inhaled corticosteroid + long-acting beta2-agonist (PM-ICS + LABA), PM-other (other pre-medication), H-Pulmicort/celestamine (Pulmicort/celestamine during hospitalization), and H-azithromycin (azithromycin during hospitalization) were found to be highly important. The support vector machine approach with a linear kernel was able to diffrentiate between predominantly allergic asthma and non-allergic asthma with higher accuracy (77.8%), precision (0.81), with a true positive rate of 0.73 and a true negative rate of 0.81, a F1 score of 0.81, and a ROC-AUC score of 0.79. Logistic regression was found to be the second-best classifier with an overall accuracy of 76.2%. CONCLUSION Predominantly allergic and non-allergic asthma can be classified using machine learning approaches based on nine features. Supplemental data for this article is available online at at www.tandfonline.com/ijas .
Collapse
Affiliation(s)
- Piyush Bhardwaj
- Centre for Advanced Computational Solutions (C-fACS), Department of Molecular Biosciences, Lincoln University, Lincoln, Christchurch, New Zealand
| | - Ashish Tyagi
- Department of Forensic Medicine & Toxicology, SHKM Govt. Medical College, Nuh, Haryana, India
| | - Shashank Tyagi
- Department of Forensic Medicine & Toxicology, Lady Hardinge Medical College & Associated Hospitals, New Delhi, India
| | - Joana Antão
- Lab3R-Respiratory Research and Rehabilitation Laboratory, School of Health Sciences (ESSUA), Department of Medical Sciences, Institute of Biomedicine (iBiMED), University of Aveiro, Aveiro, Portugal.,Department of Research and Education, CIRO, Horn, The Netherlands
| | - Qichen Deng
- Department of Research and Education, CIRO, Horn, The Netherlands.,Department of Respiratory Medicine, NUTRIM School of Nutrition and Translational Research in Metabolism, Maastricht University Medical Centre, Maastricht, The Netherlands.,Faculty of Health, Medicine and Life Sciences, Maastricht University Medical Centre, Limburg, The Netherlands
| |
Collapse
|
2
|
A combinatory algorithm for identifying genes in childhood acute lymphoblastic leukemia. GENE REPORTS 2022. [DOI: 10.1016/j.genrep.2021.101433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
3
|
Bhardwaj P, Tiwari P, Olejar K, Parr W, Kulasiri D. A machine learning application in wine quality prediction. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
4
|
Tyagi A, Tiwari P, Bhardwaj P, Chawla H. Prognosis of sexual dimorphism with unfused hyoid bone: Artificial intelligence informed decision making with discriminant analysis. Sci Justice 2021; 61:789-796. [PMID: 34802653 DOI: 10.1016/j.scijus.2021.10.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 06/22/2021] [Accepted: 10/04/2021] [Indexed: 11/18/2022]
Abstract
Depending on the metric and non-metric skeletal features of various bones, forensic experts proposed diverse sex identification methods. The main focus of the present study is to calculate sexual dimorphism in human unfused or disarticulated hyoid bone and compared it with studies conducted by different researchers. For this study, 293 unfused hyoid bones were accumulated and investigated from 173 male and 120 female cadavers of the northwest Indian population from the age of 15 to 80 years. Initially, discriminant analysis was performed on the dataset to predict sex and to get an idea for the crucial variables for sexual dimorphism. Later, significant variables predicted by the discriminant analysis were used for machine learning approaches to improve accuracy for sex determination. The standard scaler method is used for pre-processing of the data before machine learning analysis and to prevent overfitting and underfitting, 70 % of the whole dataset was utilized in the training of the model and the remaining data were used for testing the model. According to the discriminant analysis, body length (BL) and body height (BH) were found to be highly significant for the sex determination and predicted sex with 75.1 % accuracy. However, implementation of machine learning approaches such as the XG Boost classifier increased the accuracy to 83 % with sensitivity, and specificity scores of 0.81 and 0.84, respectively. Moreover, the ROC-AUC score achieved by the XG Boost classifier is 0.89; indicating machine learning investigation can improve the sex determination accuracy up to the appropriate standard.
Collapse
Affiliation(s)
- Ashish Tyagi
- Department of Forensic Medicine & Toxicology, SHKM Govt. Medical College, Nalhar, Nuh, Haryana 122107, India
| | - Parul Tiwari
- Centre for Advanced Computational Solutions (C-fACS), Department of Molecular Biosciences, Lincoln University, PO Box 85084, Lincoln 7647, Christchurch, New Zealand
| | - Piyush Bhardwaj
- Centre for Advanced Computational Solutions (C-fACS), Department of Molecular Biosciences, Lincoln University, PO Box 85084, Lincoln 7647, Christchurch, New Zealand.
| | - Hitesh Chawla
- Department of Forensic Medicine & Toxicology, SHKM Govt. Medical College, Nalhar, Nuh, Haryana 122107, India
| |
Collapse
|
5
|
Hameed SS, Hassan R, Hassan WH, Muhammadsharif FF, Latiff LA. HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets. PLoS One 2021; 16:e0246039. [PMID: 33507983 PMCID: PMC7842997 DOI: 10.1371/journal.pone.0246039] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Accepted: 01/12/2021] [Indexed: 11/24/2022] Open
Abstract
The selection and classification of genes is essential for the identification of related genes to a specific disease. Developing a user-friendly application with combined statistical rigor and machine learning functionality to help the biomedical researchers and end users is of great importance. In this work, a novel stand-alone application, which is based on graphical user interface (GUI), is developed to perform the full functionality of gene selection and classification in high dimensional datasets. The so-called HDG-select application is validated on eleven high dimensional datasets of the format CSV and GEO soft. The proposed tool uses the efficient algorithm of combined filter-GBPSO-SVM and it was made freely available to users. It was found that the proposed HDG-select outperformed other tools reported in literature and presented a competitive performance, accessibility, and functionality.
Collapse
Affiliation(s)
- Shilan S. Hameed
- Computer Systems and Networks (CSN), Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
- Directorate of Information Technology, Koya University, Koya, Kurdistan Region-F.R., Iraq
| | - Rohayanti Hassan
- School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Johor Bahru, Johor, Malaysia
| | - Wan Haslina Hassan
- Computer Systems and Networks (CSN), Malaysia-Japan International Institute of Technology (MJIIT), Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
| | - Fahmi F. Muhammadsharif
- Department of Physics, Faculty of Science and Health, Koya University, Koya, Kurdistan Region-F.R., Iraq
| | - Liza Abdul Latiff
- U-BAN Research Group, Razak Faculty of Technology and Informatics, Universiti Teknologi Malaysia, Kuala Lumpur, Malaysia
| |
Collapse
|
6
|
Das S, Rai SN. Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E1205. [PMID: 33286973 PMCID: PMC7712650 DOI: 10.3390/e22111205] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Revised: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 12/16/2022]
Abstract
Selection of biologically relevant genes from high-dimensional expression data is a key research problem in gene expression genomics. Most of the available gene selection methods are either based on relevancy or redundancy measure, which are usually adjudged through post selection classification accuracy. Through these methods the ranking of genes was conducted on a single high-dimensional expression data, which led to the selection of spuriously associated and redundant genes. Hence, we developed a statistical approach through combining a support vector machine with Maximum Relevance and Minimum Redundancy under a sound statistical setup for the selection of biologically relevant genes. Here, the genes were selected through statistical significance values and computed using a nonparametric test statistic under a bootstrap-based subject sampling model. Further, a systematic and rigorous evaluation of the proposed approach with nine existing competitive methods was carried on six different real crop gene expression datasets. This performance analysis was carried out under three comparison settings, i.e., subject classification, biological relevant criteria based on quantitative trait loci and gene ontology. Our analytical results showed that the proposed approach selects genes which are more biologically relevant as compared to the existing methods. Moreover, the proposed approach was also found to be better with respect to the competitive existing methods. The proposed statistical approach provides a framework for combining filter and wrapper methods of gene selection.
Collapse
Affiliation(s)
- Samarendra Das
- Division of Statistical Genetics, Indian Council of Agricultural Research (ICAR)-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India;
- Netaji Subhas-Indian Council of Agricultural Research (ICAR) International Fellow, Indian Council of Agricultural Research, Krishi Bhawan, New Delhi 110001, India
- Biostatistics and Bioinformatics Facility, JG Brown Cancer Center, University of Louisville, Louisville, KY 40292, USA
- School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY 40292, USA
| | - Shesh N. Rai
- Biostatistics and Bioinformatics Facility, JG Brown Cancer Center, University of Louisville, Louisville, KY 40292, USA
- School of Interdisciplinary and Graduate Studies, University of Louisville, Louisville, KY 40292, USA
- Alcohol Research Center, University of Louisville, Louisville, KY 40292, USA
- Department of Hepatobiology and Toxicology, University of Louisville, Louisville, KY 40292, USA
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40292, USA
- Wendell Cherry Chair in Clinical Trial Research, University of Louisville, Louisville, KY 40292, USA
| |
Collapse
|
7
|
Huang S, Blatti C, Sinha S, Parameswaran A. Uncovering Effective Explanations for Interactive Genomic Data Analysis. PATTERNS 2020; 1:100093. [PMID: 33205133 PMCID: PMC7660438 DOI: 10.1016/j.patter.2020.100093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/13/2020] [Accepted: 08/05/2020] [Indexed: 10/25/2022]
|
8
|
Cherlin S, Wason JMS. Developing and testing high‐efficacy patient subgroups within a clinical trial using risk scores. Stat Med 2020; 39:3285-3298. [PMID: 32662542 PMCID: PMC7611900 DOI: 10.1002/sim.8665] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 03/18/2020] [Accepted: 05/28/2020] [Indexed: 12/13/2022]
Abstract
There is the potential for high-dimensional information about patients collected in clinical trials (such as genomic, imaging, and data from wearable technologies) to be informative for the efficacy of a new treatment in situations where only a subset of patients benefits from the treatment. The adaptive signature design (ASD) method has been proposed for developing and testing the efficacy of a treatment in a high-efficacy patient group (the sensitive group) using genetic data. The method requires selection of three tuning parameters which may be highly computationally expensive. We propose a variation to the ASD method, the cross-validated risk scores (CVRS) design method, that does not require selection of any tuning parameters. The method is based on computing a risk score for each patient and dividing them into clusters using a nonparametric clustering procedure.We assess the properties of CVRS against the originally proposed cross-validated ASD using simulation data and a real psychiatry trial. CVRS, as assessed for various sample sizes and response rates, has a substantial reduction in the computational time required. In many simulation scenarios, there is a substantial improvement in the ability to correctly identify the sensitive group and the power of the design to detect a treatment effect in the sensitive group.We illustrate the application of the CVRS method on the psychiatry trial.
Collapse
Affiliation(s)
- Svetlana Cherlin
- Newcastle Clinical Trials Unit Newcastle University Newcastle upon Tyne UK
- Population Health Sciences Institute Newcastle University Newcastle upon Tyne UK
| | - James M. S. Wason
- Population Health Sciences Institute Newcastle University Newcastle upon Tyne UK
- MRC Biostatistics Unit Cambridge Institute of Public Health Cambridge UK
| |
Collapse
|
9
|
Considine EC. The Search for Clinically Useful Biomarkers of Complex Disease: A Data Analysis Perspective. Metabolites 2019; 9:E126. [PMID: 31269649 PMCID: PMC6680669 DOI: 10.3390/metabo9070126] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Revised: 06/20/2019] [Accepted: 06/28/2019] [Indexed: 12/25/2022] Open
Abstract
Unmet clinical diagnostic needs exist for many complex diseases, which (it is hoped) will be solved by the discovery of metabolomics biomarkers. However, at present, no diagnostic tests based on metabolomics have yet been introduced to the clinic. This review is presented as a research perspective on how data analysis methods in metabolomics biomarker discovery may contribute to the failure of biomarker studies and suggests how such failures might be mitigated. The study design and data pretreatment steps are reviewed briefly in this context, and the actual data analysis step is examined more closely.
Collapse
Affiliation(s)
- Elizabeth C Considine
- The Irish Centre for Fetal and Neonatal Translational Research (INFANT), Department of Obstetrics and Gynaecology, University College Cork, T12 YE02 Cork, Ireland.
| |
Collapse
|
10
|
Bhowmick SS, Bhattacharjee D, Rato L. In silico markers: an evolutionary and statistical approach to select informative genes of human breast cancer subtypes. Genes Genomics 2019; 41:1371-1382. [PMID: 31004329 DOI: 10.1007/s13258-019-00816-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2018] [Accepted: 04/02/2019] [Indexed: 10/27/2022]
Abstract
BACKGROUND Recent advancement in bioinformatics offers the ability to identify informative genes from high dimensional gene expression data. Selection of informative genes from these large datasets has emerged as an issue of major concern among researchers. OBJECTIVE Gene functionality and regulatory mechanisms can be understood through the analysis of these gene expression data. Here, we present a computational method to identify informative genes for breast cancer subtypes such as Basal, human epidermal growth factor receptor 2 (Her2), luminal A (LumA), and luminal B (LumB). METHODS The proposed In Silico Markers method is a wrapper feature selection method based on Least Absolute Shrinkage and Selection Operator (LASSO), Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and Support Vector Machine (SVM) as a classifier. Moreover, the composite measure consisting of relevance, redundancy, and rank score of frequently appeared genes are used to select informative genes. RESULTS The informative genes are validated by statistical and biologically relevant criteria. For a comparative evaluation of the proposed approach, biological similarity score designed on semantic similarity measure of GO terms are investigated. Further, the proposed technique is evaluated with 7 existing gene selection techniques using two-class annotated breast cancer subtype datasets. CONCLUSION The utilization of this method can bring about the discovery of informative genes. Furthermore, under multiple criteria decision-making set-up, informative genes selected by the In Silico Markers are found to be admirable than the compared methods selected genes.
Collapse
Affiliation(s)
- Shib Sankar Bhowmick
- Department of Electronics and Communication Engineering, Heritage Institute of Technology, Kolkata, 700107, India.
| | - Debotosh Bhattacharjee
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
| | - Luis Rato
- Department of Informatics, University of Evora, 7004-516, Evora, Portugal
| |
Collapse
|
11
|
Emura T, Matsui S, Chen HY. compound.Cox: Univariate feature selection and compound covariate for predicting survival. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 168:21-37. [PMID: 30527130 DOI: 10.1016/j.cmpb.2018.10.020] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Revised: 09/26/2018] [Accepted: 10/26/2018] [Indexed: 05/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Univariate feature selection is one of the simplest and most commonly used techniques to develop a multigene predictor for survival. Presently, there is no software tailored to perform univariate feature selection and predictor construction. METHODS We develop the compound.Cox R package that implements univariate significance tests (via the Wald tests or score tests) for feature selection. We provide a cross-validation algorithm to measure predictive capability of selected genes and a permutation algorithm to assess the false discovery rate. We also provide three algorithms for constructing a multigene predictor (compound covariate, compound shrinkage, and copula-based methods), which are tailored to the subset of genes obtained from univariate feature selection. We demonstrate our package using survival data on the lung cancer patients. We examine the predictive capability of the developed algorithms by the lung cancer data and simulated data. RESULTS The developed R package, compound.Cox, is available on the CRAN repository. The statistical tools in compound.Cox allow researchers to determine an optimal significance level of the tests, thus providing researchers an optimal subset of genes for prediction. The package also allows researchers to compute the false discovery rate and various prediction algorithms.
Collapse
Affiliation(s)
- Takeshi Emura
- Graduate Institute of Statistics, National Central University, Zhongda Road, Zhongli District, Taoyuan 32001, Taiwan.
| | - Shigeyuki Matsui
- Department of Biostatistics, Nagoya University Graduate School of Medicine, 65 Tsurumai-cho, Showa-ku, Nagoya, 466-8550, Japan
| | - Hsuan-Yu Chen
- Institute of Statistical Science, Academia Sinica, 128 Academia Road Sec.2, Nankang Taipei 115, Taiwan
| |
Collapse
|
12
|
Wu HC, Wei XG, Chan SC. Novel Consensus Gene Selection Criteria for Distributed GPU Partial Least Squares-Based Gene Microarray Analysis in Diffused Large B Cell Lymphoma (DLBCL) and Related Findings. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:2039-2052. [PMID: 28991749 DOI: 10.1109/tcbb.2017.2760827] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This paper proposes a novel consensus gene selection criteria for partial least squares-based gene microarray analysis. By quantifying the extent of consistency and distinctiveness of the differential gene expressions across different double cross validations (CV) or randomizations in terms of occurrence and randomization p-values, the proposed criteria are able to identify a more comprehensive genes associated with the underlying disease. A Distributed GPU implementation has been proposed to accelerate the gene selection problem and about 8-11 times speed up has been achieved based on the microarray datasets considered. Simulation results using various cancer gene microarray datasets show that the proposed approach is able to achieve highly comparable classification accuracy in comparing with many conventional approaches. Furthermore, enrichment analysis on the selected genes for Diffused Large B Cell Lymphoma (DLBCL) and Prostate Cancer datasets and show that only the proposed approach is able to identify gene lists enriched in different pathways with significant p-values. In contrast, sufficient statistical significance cannot be found for conventional SVM-RFE and the t-test. The reliability in identifying and establishing statistical significance of the gene findings makes the proposed approach an attractive alternative for cancer related researches based on gene expression profiling or other similar data.
Collapse
|
13
|
Statistical approach for selection of biologically informative genes. Gene 2018; 655:71-83. [PMID: 29458166 DOI: 10.1016/j.gene.2018.02.044] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Revised: 11/26/2017] [Accepted: 02/14/2018] [Indexed: 11/23/2022]
Abstract
Selection of informative genes from high dimensional gene expression data has emerged as an important research area in genomics. Many gene selection techniques have been proposed so far are either based on relevancy or redundancy measure. Further, the performance of these techniques has been adjudged through post selection classification accuracy computed through a classifier using the selected genes. This performance metric may be statistically sound but may not be biologically relevant. A statistical approach, i.e. Boot-MRMR, was proposed based on a composite measure of maximum relevance and minimum redundancy, which is both statistically sound and biologically relevant for informative gene selection. For comparative evaluation of the proposed approach, we developed two biological sufficient criteria, i.e. Gene Set Enrichment with QTL (GSEQ) and biological similarity score based on Gene Ontology (GO). Further, a systematic and rigorous evaluation of the proposed technique with 12 existing gene selection techniques was carried out using five gene expression datasets. This evaluation was based on a broad spectrum of statistically sound (e.g. subject classification) and biological relevant (based on QTL and GO) criteria under a multiple criteria decision-making framework. The performance analysis showed that the proposed technique selects informative genes which are more biologically relevant. The proposed technique is also found to be quite competitive with the existing techniques with respect to subject classification and computational time. Our results also showed that under the multiple criteria decision-making setup, the proposed technique is best for informative gene selection over the available alternatives. Based on the proposed approach, an R Package, i.e. BootMRMR has been developed and available at https://cran.r-project.org/web/packages/BootMRMR. This study will provide a practical guide to select statistical techniques for selecting informative genes from high dimensional expression data for breeding and system biology studies.
Collapse
|
14
|
Hameed SS, Hassan R, Muhammad FF. Selection and classification of gene expression in autism disorder: Use of a combination of statistical filters and a GBPSO-SVM algorithm. PLoS One 2017; 12:e0187371. [PMID: 29095904 PMCID: PMC5667738 DOI: 10.1371/journal.pone.0187371] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2017] [Accepted: 10/18/2017] [Indexed: 11/30/2022] Open
Abstract
In this work, gene expression in autism spectrum disorder (ASD) is analyzed with the goal of selecting the most attributed genes and performing classification. The objective was achieved by utilizing a combination of various statistical filters and a wrapper-based geometric binary particle swarm optimization-support vector machine (GBPSO-SVM) algorithm. The utilization of different filters was accentuated by incorporating a mean and median ratio criterion to remove very similar genes. The results showed that the most discriminative genes that were identified in the first and last selection steps included the presence of a repetitive gene (CAPS2), which was assigned as the gene most highly related to ASD risk. The merged gene subset that was selected by the GBPSO-SVM algorithm was able to enhance the classification accuracy.
Collapse
Affiliation(s)
- Shilan S. Hameed
- Department of Computer Science, Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia
- Department of Software and Informatics Engineering, College of Engineering, Salahaddin University, Erbil, Kurdistan Region, Iraq
| | - Rohayanti Hassan
- Department of Software Engineering, Faculty of Computing, Universiti Teknologi Malaysia, Johor Bahru, Malaysia
| | - Fahmi F. Muhammad
- Department of Physics, Faculty of Science & Health, Koya University, Koya, Kurdistan Region, Iraq
| |
Collapse
|
15
|
Alexe G, Dalgin G, Ramaswamy R, Delisi C, Bhanot G. Data Perturbation Independent Diagnosis and Validation of Breast Cancer Subtypes Using Clustering and Patterns. Cancer Inform 2017. [DOI: 10.1177/117693510600200006] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Molecular stratification of disease based on expression levels of sets of genes can help guide therapeutic decisions if such classifications can be shown to be stable against variations in sample source and data perturbation. Classifications inferred from one set of samples in one lab should be able to consistently stratify a different set of samples in another lab. We present a method for assessing such stability and apply it to the breast cancer (BCA) datasets of Sorlie et al. 2003 and Ma et al. 2003. We find that within the now commonly accepted BCA categories identified by Sorlie et al. Luminal A and Basal are robust, but Luminal B and ERBB2+ are not. In particular, 36% of the samples identified as Luminal B and 55% identified as ERBB2+ cannot be assigned an accurate category because the classification is sensitive to data perturbation. We identify a “core cluster” of samples for each category, and from these we determine “patterns” of gene expression that distinguish the core clusters from each other. We find that the best markers for Luminal A and Basal are (ESR1, LIV1, GATA-3) and (CCNE1, LAD1, KRT5), respectively. Pathways enriched in the patterns regulate apoptosis, tissue remodeling and the immune response. We use a different dataset (Ma et al. 2003) to test the accuracy with which samples can be allocated to the four disease subtypes. We find, as expected, that the classification of samples identified as Luminal A and Basal is robust but classification into the other two subtypes is not.
Collapse
Affiliation(s)
- G. Alexe
- Computational Biology Center, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, U.S.A
- The Simons Center for Systems Biology, Institute for Advanced Study, Princeton NJ 08540, U.S.A
| | - G.S. Dalgin
- Molecular Biology, Cell Biology and Biochemistry Program, Boston University, 2 Cummington Street, Boston, MA 02215, U.S.A
| | - R. Ramaswamy
- The Simons Center for Systems Biology, Institute for Advanced Study, Princeton NJ 08540, U.S.A
- School of Information Technology, Jawaharlal Nehru University, New Delhi 110 067, India
| | - C. Delisi
- Biomedical Engineering, Boston University, 44 Cummington Street, Boston, MA 02215, U.S.A
| | - G. Bhanot
- Computational Biology Center, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, U.S.A
- The Simons Center for Systems Biology, Institute for Advanced Study, Princeton NJ 08540, U.S.A
- Biomedical Engineering, Boston University, 44 Cummington Street, Boston, MA 02215, U.S.A
- Department of Biomedical Engineering and BioMaPS Institute, Rutgers University, Piscataway, NJ 08854
| |
Collapse
|
16
|
Damon C, Luck M, Toullec L, Etienne I, Buchler M, Hurault de Ligny B, Choukroun G, Thierry A, Vigneau C, Moulin B, Heng AE, Subra JF, Legendre C, Monnot A, Yartseva A, Bateson M, Laurent-Puig P, Anglicheau D, Beaune P, Loriot MA, Thervet E, Pallet N. Predictive Modeling of Tacrolimus Dose Requirement Based on High-Throughput Genetic Screening. Am J Transplant 2017; 17:1008-1019. [PMID: 27597269 DOI: 10.1111/ajt.14040] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Revised: 08/24/2016] [Accepted: 08/26/2016] [Indexed: 01/25/2023]
Abstract
Any biochemical reaction underlying drug metabolism depends on individual gene-drug interactions and on groups of genes interacting together. Based on a high-throughput genetic approach, we sought to identify a set of covariant single-nucleotide polymorphisms predictive of interindividual tacrolimus (Tac) dose requirement variability. Tac blood concentrations (Tac C0 ) of 229 kidney transplant recipients were repeatedly monitored after transplantation over 3 mo. Given the high dimension of the genomic data in comparison to the low number of observations and the high multicolinearity among the variables (gene variants), we developed an original predictive approach that integrates an ensemble variable-selection strategy to reinforce the stability of the variable-selection process and multivariate modeling. Our predictive models explained up to 70% of total variability in Tac C0 per dose with a maximum of 44 gene variants (p-value <0.001 with a permutation test). These models included molecular networks of drug metabolism with oxidoreductase activities and the multidrug-resistant ABCC8 transporter, which was found in the most stringent model. Finally, we identified an intronic variant of the gene encoding SLC28A3, a drug transporter, as a key gene involved in Tac metabolism, and we confirmed it in an independent validation cohort.
Collapse
Affiliation(s)
- C Damon
- Hypercube Institute, Paris, France
| | - M Luck
- Hypercube Institute, Paris, France.,Paris Descartes University, Paris, France
| | - L Toullec
- Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
| | - I Etienne
- Department of Nephrology, Rouen University Hospital, Rouen, France
| | - M Buchler
- Department of Nephrology, Tours University Hospital, Tours, France
| | | | - G Choukroun
- Department of Nephrology, Amiens University Hospital, Amiens, France
| | - A Thierry
- Department of Nephrology, Poitiers University Hospital, Poitiers, France
| | - C Vigneau
- Department of Nephrology, Rennes University Hospital, Rennes, France
| | - B Moulin
- Department of Nephrology, Strasbourg University Hospital, Strasbourg, France
| | - A-E Heng
- Department of Nephrology, Clermont-Ferrand University Hospital, Clermont-Ferrand, France
| | - J-F Subra
- Department of Nephrology, Angers University Hospital, Angers, France
| | - C Legendre
- Department of Nephrology, Necker Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
| | - A Monnot
- Hypercube Institute, Paris, France
| | | | | | - P Laurent-Puig
- Paris Descartes University, Paris, France.,Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France.,Institut National pour la Santé et la Recherche Médicale (INSERM) U1147, Paris, France
| | - D Anglicheau
- Department of Nephrology, Necker Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
| | - P Beaune
- Paris Descartes University, Paris, France.,Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France.,Institut National pour la Santé et la Recherche Médicale (INSERM) U1147, Paris, France
| | - M A Loriot
- Paris Descartes University, Paris, France.,Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France.,Institut National pour la Santé et la Recherche Médicale (INSERM) U1147, Paris, France
| | - E Thervet
- Paris Descartes University, Paris, France.,Department of Nephrology, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
| | - N Pallet
- Paris Descartes University, Paris, France.,Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France.,Institut National pour la Santé et la Recherche Médicale (INSERM) U1147, Paris, France.,Department of Nephrology, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
| |
Collapse
|
17
|
A Meta-Review of Feature Selection Techniques in the Context of Microarray Data. BIOINFORMATICS AND BIOMEDICAL ENGINEERING 2017. [DOI: 10.1007/978-3-319-56148-6_3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
18
|
Bari MG, Salekin S, Zhang JM. A Robust and Efficient Feature Selection Algorithm for Microarray Data. Mol Inform 2016; 36. [PMID: 28000384 DOI: 10.1002/minf.201600099] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2016] [Accepted: 11/21/2016] [Indexed: 12/20/2022]
Abstract
In the past decades, a few synergistic feature selection algorithms have been published, which includes Cooperative Index (CI) and K-Top Scoring Pair (k-TSP). These algorithms consider the synergistic behavior of features when they are included in a feature panel. Although promising results have been shown for these algorithms, there is lack of a comprehensive and fair comparison with other feature selection algorithms across a large number of microarray datasets in terms of classification accuracy and computational complexity. There is a need in evaluating their performance and reducing the complexity of such algorithms. We compared the performance of synergistic feature selection algorithms with 11 other commonly used algorithms based on 22 microarray gene expression binary class datasets. The evaluation confirms that synergistic algorithms such as CI and k-TSP will gradually increase the classification performance as more features are used in the classifiers. Also, in order to cut down computational cost, we proposed a new feature selection ranking score called Positive Synergy Index (PSI). Testing results show that features selected using PSI as well as synergistic feature selection algorithms provide better performance compared to with all other methods, while PSI has a computational complexity significantly lower than that of other synergistic algorithms.
Collapse
Affiliation(s)
- Mehrab Ghanat Bari
- Dept. of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, 55905
| | - Sirajul Salekin
- Dept. of Electrical and Computer Engineering, The University of Texas as San Antonio, San Antonio, TX, 78249
| | - Jianqiu Michelle Zhang
- Dept. of Electrical and Computer Engineering, The University of Texas as San Antonio, San Antonio, TX, 78249
| |
Collapse
|
19
|
Sardana M, Agrawal R, Kaur B. A hybrid of clustering and quantum genetic algorithm for relevant genes selection for cancer microarray data. INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS 2016. [DOI: 10.3233/kes-160341] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
| | - R.K. Agrawal
- School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi, India
| | - Baljeet Kaur
- Hansraj College, University of Delhi, Delhi, India
| |
Collapse
|
20
|
Tabakhi S, Najafi A, Ranjbar R, Moradi P. Gene selection for microarray data classification using a novel ant colony optimization. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2015.05.022] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
21
|
Boulesteix AL, Hable R, Lauer S, Eugster MJA. A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies. AM STAT 2015. [DOI: 10.1080/00031305.2015.1005128] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
22
|
Drotár P, Gazda J, Smékal Z. An experimental comparison of feature selection methods on two-class biomedical datasets. Comput Biol Med 2015; 66:1-10. [PMID: 26327447 DOI: 10.1016/j.compbiomed.2015.08.010] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Revised: 08/05/2015] [Accepted: 08/12/2015] [Indexed: 11/30/2022]
Abstract
Feature selection is a significant part of many machine learning applications dealing with small-sample and high-dimensional data. Choosing the most important features is an essential step for knowledge discovery in many areas of biomedical informatics. The increased popularity of feature selection methods and their frequent utilisation raise challenging new questions about the interpretability and stability of feature selection techniques. In this study, we compared the behaviour of ten state-of-the-art filter methods for feature selection in terms of their stability, similarity, and influence on prediction performance. All of the experiments were conducted on eight two-class datasets from biomedical areas. While entropy-based feature selection appears to be the most stable, the feature selection techniques yielding the highest prediction performance are minimum redundance maximum relevance method and feature selection based on Bhattacharyya distance. In general, univariate feature selection techniques perform similarly to or even better than more complex multivariate feature selection techniques with high-dimensional datasets. However, with more complex and smaller datasets multivariate methods slightly outperform univariate techniques.
Collapse
Affiliation(s)
- P Drotár
- Department of Telecommunications, Brno University of Technology, Technická 12, 61200 Brno, Czech Republic.
| | - J Gazda
- Department of Computers and Informatics, Technical University of Kosice, Letna 9, 0401 Kosice, Slovakia
| | - Z Smékal
- Department of Telecommunications, Brno University of Technology, Technická 12, 61200 Brno, Czech Republic
| |
Collapse
|
23
|
Hemphill E, Lindsay J, Lee C, Măndoiu II, Nelson CE. Feature selection and classifier performance on diverse bio- logical datasets. BMC Bioinformatics 2014; 15 Suppl 13:S4. [PMID: 25434802 PMCID: PMC4248652 DOI: 10.1186/1471-2105-15-s13-s4] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types. Results This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness. Conclusions As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data. Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst. It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis.
Collapse
|
24
|
Wang X. Identification of Marker Genes for Cancer Based on Microarrays Using a Computational Biology Approach. Curr Bioinform 2014; 9:140-146. [PMID: 24683388 DOI: 10.2174/1574893608999140109115649] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Rapid advances in gene expression microarray technology have enabled to discover molecular markers used for cancer diagnosis, prognosis, and prediction. One computational challenge with using microarray data analysis to create cancer classifiers is how to effectively deal with microarray data which are composed of high-dimensional attributes (p) and low-dimensional instances (n). Gene selection and classifier construction are two key issues concerned with this topics. In this article, we reviewed major methods for computational identification of cancer marker genes. We concluded that simple methods should be preferred to complicated ones for their interpretability and applicability.
Collapse
Affiliation(s)
- Xiaosheng Wang
- Biometric Research Branch, National Cancer Institute, National Institutes of Health, Rockville, MD 20852, U.S.A
| |
Collapse
|
25
|
Leiva R, Roy A. Classification of Higher-order Data with Separable Covariance and Structured Multiplicative or Additive Mean Models. COMMUN STAT-THEOR M 2014. [DOI: 10.1080/03610926.2013.841931] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
26
|
A comparative analysis of biomarker selection techniques. BIOMED RESEARCH INTERNATIONAL 2013; 2013:387673. [PMID: 24324960 PMCID: PMC3842054 DOI: 10.1155/2013/387673] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/23/2013] [Revised: 09/22/2013] [Accepted: 09/23/2013] [Indexed: 11/17/2022]
Abstract
Feature selection has become the essential step in biomarker discovery from high-dimensional genomics data. It is recognized that different feature selection techniques may result in different set of biomarkers, that is, different groups of genes highly correlated to a given pathological condition, but few direct comparisons exist which quantify these differences in a systematic way. In this paper, we propose a general methodology for comparing the outcomes of different selection techniques in the context of biomarker discovery. The comparison is carried out along two dimensions: (i) measuring the similarity/dissimilarity of selected gene sets; (ii) evaluating the implications of these differences in terms of both predictive performance and stability of selected gene sets. As a case study, we considered three benchmarks deriving from DNA microarray experiments and conducted a comparative analysis among eight selection methods, representatives of different classes of feature selection techniques. Our results show that the proposed approach can provide useful insight about the pattern of agreement of biomarker discovery techniques.
Collapse
|
27
|
Genomic biomarkers for personalized medicine: development and validation in clinical studies. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013; 2013:865980. [PMID: 23690882 PMCID: PMC3652056 DOI: 10.1155/2013/865980] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2013] [Accepted: 03/22/2013] [Indexed: 12/26/2022]
Abstract
The establishment of high-throughput technologies has brought substantial advances to our understanding of the biology of many diseases at the molecular level and increasing expectations on the development of innovative molecularly targeted treatments and molecular biomarkers or diagnostic tests in the context of clinical studies. In this review article, we position the two critical statistical analyses of high-dimensional genomic data, gene screening and prediction, in the framework of development and validation of genomic biomarkers or signatures, through taking into consideration the possible different strategies for developing genomic signatures. A wide variety of biomarker-based clinical trial designs to assess clinical utility of a biomarker or a new treatment with a companion biomarker are also discussed.
Collapse
|
28
|
Abdel Samee NM, Solouma NH, Kadah YM. Detection of biomarkers for hepatocellular carcinoma using a hybrid univariate gene selection methods. Theor Biol Med Model 2012; 9:34. [PMID: 22867264 PMCID: PMC3570375 DOI: 10.1186/1742-4682-9-34] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2012] [Accepted: 07/03/2012] [Indexed: 05/26/2023] Open
Abstract
Background Discovering new biomarkers has a great role in improving early diagnosis of Hepatocellular carcinoma (HCC). The experimental determination of biomarkers needs a lot of time and money. This motivates this work to use in-silico prediction of biomarkers to reduce the number of experiments required for detecting new ones. This is achieved by extracting the most representative genes in microarrays of HCC. Results In this work, we provide a method for extracting the differential expressed genes, up regulated ones, that can be considered candidate biomarkers in high throughput microarrays of HCC. We examine the power of several gene selection methods (such as Pearson’s correlation coefficient, Cosine coefficient, Euclidean distance, Mutual information and Entropy with different estimators) in selecting informative genes. A biological interpretation of the highly ranked genes is done using KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, ENTREZ and DAVID (Database for Annotation, Visualization, and Integrated Discovery) databases. The top ten genes selected using Pearson’s correlation coefficient and Cosine coefficient contained six genes that have been implicated in cancer (often multiple cancers) genesis in previous studies. A fewer number of genes were obtained by the other methods (4 genes using Mutual information, 3genes using Euclidean distance and only one gene using Entropy). A better result was obtained by the utilization of a hybrid approach based on intersecting the highly ranked genes in the output of all investigated methods. This hybrid combination yielded seven genes (2 genes for HCC and 5 genes in different types of cancer) in the top ten genes of the list of intersected genes. Conclusions To strengthen the effectiveness of the univariate selection methods, we propose a hybrid approach by intersecting several of these methods in a cascaded manner. This approach surpasses all of univariate selection methods when used individually according to biological interpretation and the examination of gene expression signal profiles.
Collapse
Affiliation(s)
- Nagwan M Abdel Samee
- Computer Engineering Department, Misr University for Science and Technology, Giza, Egypt.
| | | | | |
Collapse
|
29
|
Siebourg J, Merdes G, Misselwitz B, Hardt WD, Beerenwinkel N. Stability of gene rankings from RNAi screens. ACTA ACUST UNITED AC 2012; 28:1612-8. [PMID: 22513992 DOI: 10.1093/bioinformatics/bts192] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
MOTIVATION Genome-wide RNA interference (RNAi) experiments are becoming a widely used approach for identifying intracellular molecular pathways of specific functions. However, detecting all relevant genes involved in a biological process is challenging, because typically only few samples per gene knock-down are available and readouts tend to be very noisy. We investigate the reliability of top scoring hit lists obtained from RNAi screens, compare the performance of different ranking methods, and propose a new ranking method to improve the reproducibility of gene selection. RESULTS The performance of different ranking methods is assessed by the size of the stable sets they produce, i.e. the subsets of genes which are estimated to be re-selected with high probability in independent validation experiments. Using stability selection, we also define a new ranking method, called stability ranking, to improve the stability of any given base ranking method. Ranking methods based on mean, median, t-test and rank-sum test, and their stability-augmented counterparts are compared in simulation studies and on three microscopy image RNAi datasets. We find that the rank-sum test offers the most favorable trade-off between ranking stability and accuracy and that stability ranking improves the reproducibility of all and the accuracy of several ranking methods. AVAILABILITY Stability ranking is freely available as the R/Bioconductor package staRank at http://www.cbg.ethz.ch/software/staRank.
Collapse
Affiliation(s)
- Juliane Siebourg
- Department of Biosystems Science and Engineering, ETH Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
| | | | | | | | | |
Collapse
|
30
|
Tapia E, Bulacio P, Angelone L. Sparse and stable gene selection with consensus SVM-RFE. Pattern Recognit Lett 2012. [DOI: 10.1016/j.patrec.2011.09.031] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
31
|
Haury AC, Gestraud P, Vert JP. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 2011; 6:e28210. [PMID: 22205940 PMCID: PMC3244389 DOI: 10.1371/journal.pone.0028210] [Citation(s) in RCA: 159] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2011] [Accepted: 11/03/2011] [Indexed: 01/08/2023] Open
Abstract
Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. In this study we compare feature selection methods on public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Surprisingly, complex wrapper and embedded methods generally do not outperform simple univariate feature selection methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results.
Collapse
Affiliation(s)
- Anne-Claire Haury
- Mines ParisTech, Centre for Computational Biology, Fontainebleau, France.
| | | | | |
Collapse
|
32
|
Robust two-gene classifiers for cancer prediction. Genomics 2011; 99:90-5. [PMID: 22138042 DOI: 10.1016/j.ygeno.2011.11.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 11/04/2011] [Accepted: 11/09/2011] [Indexed: 11/23/2022]
Abstract
Two-gene classifiers have attracted a broad interest for their simplicity and practicality. Most existing two-gene classification algorithms were involved in exhaustive search that led to their low time-efficiencies. In this study, we proposed two new two-gene classification algorithms which used simple univariate gene selection strategy and constructed simple classification rules based on optimal cut-points for two genes selected. We detected the optimal cut-point with the information entropy principle. We applied the two-gene classification models to eleven cancer gene expression datasets and compared their classification performance to that of some established two-gene classification models like the top-scoring pairs model and the greedy pairs model, as well as standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. These comparisons indicated that the performance of our two-gene classifiers was comparable to or better than that of compared models.
Collapse
|
33
|
Wang X, Simon R. Microarray-based cancer prediction using single genes. BMC Bioinformatics 2011; 12:391. [PMID: 21982331 PMCID: PMC3228540 DOI: 10.1186/1471-2105-12-391] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Accepted: 10/07/2011] [Indexed: 11/23/2022] Open
Abstract
Background Although numerous methods of using microarray data analysis for cancer classification have been proposed, most utilize many genes to achieve accurate classification. This can hamper interpretability of the models and ease of translation to other assay platforms. We explored the use of single genes to construct classification models. We first identified the genes with the most powerful univariate class discrimination ability and then constructed simple classification rules for class prediction using the single genes. Results We applied our model development algorithm to eleven cancer gene expression datasets and compared classification accuracy to that for standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. The single gene classifiers provided classification accuracy comparable to or better than those obtained by existing methods in most cases. We analyzed the factors that determined when simple single gene classification is effective and when more complex modeling is warranted. Conclusions For most of the datasets examined, the single-gene classification methods appear to work as well as more standard methods, suggesting that simple models could perform well in microarray-based cancer prediction.
Collapse
Affiliation(s)
- Xiaosheng Wang
- Biometric Research Branch, National Cancer Institute, National Institutes of Health, Rockville, MD 20852, USA
| | | |
Collapse
|
34
|
Shi P, Ray S, Zhu Q, Kon MA. Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. BMC Bioinformatics 2011; 12:375. [PMID: 21939564 PMCID: PMC3223741 DOI: 10.1186/1471-2105-12-375] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2010] [Accepted: 09/23/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers. RESULTS We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets CONCLUSIONS The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.
Collapse
Affiliation(s)
- Ping Shi
- Harvard Medical School and Harvard Pilgrim Healthcare Institute, Boston, MA 02215, USA.
| | | | | | | |
Collapse
|
35
|
Muselli M, Bertoni A, Frasca M, Beghini A, Ruffino F, Valentini G. A mathematical model for the validation of gene selection methods. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1385-1392. [PMID: 21778526 DOI: 10.1109/tcbb.2010.83] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Gene selection methods aim at determining biologically relevant subsets of genes in DNA microarray experiments. However, their assessment and validation represent a major difficulty since the subset of biologically relevant genes is usually unknown. To solve this problem a novel procedure for generating biologically plausible synthetic gene expression data is proposed. It is based on a proper mathematical model representing gene expression signatures and expression profiles through Boolean threshold functions. The results show that the proposed procedure can be successfully adopted to analyze the quality of statistical and machine learning-based gene selection algorithms.
Collapse
|
36
|
|
37
|
Abstract
For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not "anticonservative" using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set.
Collapse
Affiliation(s)
- Kyung In Kim
- Biometric Research Branch, National Cancer Institute, 9000 Rockville Pike, MSC 7434, Bethesda, MD 20892-7434, USA
| | | |
Collapse
|
38
|
Mi Z, Shen K, Song N, Cheng C, Song C, Kaminski N, Tseng GC. Module-based prediction approach for robust inter-study predictions in microarray data. ACTA ACUST UNITED AC 2010; 26:2586-93. [PMID: 20719761 DOI: 10.1093/bioinformatics/btq472] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Traditional genomic prediction models based on individual genes suffer from low reproducibility across microarray studies due to the lack of robustness to expression measurement noise and gene missingness when they are matched across platforms. It is common that some of the genes in the prediction model established in a training study cannot be matched to another test study because a different platform is applied. The failure of inter-study predictions has severely hindered the clinical applications of microarray. To overcome the drawbacks of traditional gene-based prediction (GBP) models, we propose a module-based prediction (MBP) strategy via unsupervised gene clustering. RESULTS K-means clustering is used to group genes sharing similar expression profiles into gene modules, and small modules are merged into their nearest neighbors. Conventional univariate or multivariate feature selection procedure is applied and a representative gene from each selected module is identified to construct the final prediction model. As a result, the prediction model is portable to any test study as long as partial genes in each module exist in the test study. We demonstrate that K-means cluster sizes generally follow a multinomial distribution and the failure probability of inter-study prediction due to missing genes is diminished by merging small clusters into their nearest neighbors. By simulation and applications of real datasets in inter-study predictions, we show that the proposed MBP provides slightly improved accuracy while is considerably more robust than traditional GBP. AVAILABILITY http://www.biostat.pitt.edu/bioinfo/ CONTACT ctseng@pitt.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zhibao Mi
- Cooperative Studies Program, VA Maryland Health Care System, Perry Point, MD 21902, USA
| | | | | | | | | | | | | |
Collapse
|
39
|
Abraham G, Kowalczyk A, Loi S, Haviv I, Zobel J. Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context. BMC Bioinformatics 2010; 11:277. [PMID: 20500821 PMCID: PMC2895626 DOI: 10.1186/1471-2105-11-277] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2010] [Accepted: 05/25/2010] [Indexed: 02/08/2023] Open
Abstract
Background Different microarray studies have compiled gene lists for predicting outcomes of a range of treatments and diseases. These have produced gene lists that have little overlap, indicating that the results from any one study are unstable. It has been suggested that the underlying pathways are essentially identical, and that the expression of gene sets, rather than that of individual genes, may be more informative with respect to prognosis and understanding of the underlying biological process. Results We sought to examine the stability of prognostic signatures based on gene sets rather than individual genes. We classified breast cancer cases from five microarray studies according to the risk of metastasis, using features derived from predefined gene sets. The expression levels of genes in the sets are aggregated, using what we call a set statistic. The resulting prognostic gene sets were as predictive as the lists of individual genes, but displayed more consistent rankings via bootstrap replications within datasets, produced more stable classifiers across different datasets, and are potentially more interpretable in the biological context since they examine gene expression in the context of their neighbouring genes in the pathway. In addition, we performed this analysis in each breast cancer molecular subtype, based on ER/HER2 status. The prognostic gene sets found in each subtype were consistent with the biology based on previous analysis of individual genes. Conclusions To date, most analyses of gene expression data have focused at the level of the individual genes. We show that a complementary approach of examining the data using predefined gene sets can reduce the noise and could provide increased insight into the underlying biological pathways.
Collapse
Affiliation(s)
- Gad Abraham
- Department of Computer Science and Software Engineering, The University of Melbourne, Parkville 3010, VIC, Australia
| | | | | | | | | |
Collapse
|
40
|
Abstract
PURPOSE OF REVIEW The desire for biomarkers for diagnosis and prognosis of diseases has never been greater. With the availability of genome data and an increased availability of proteome data, the discovery of biomarkers has become increasingly feasible. This article reviews some recent applications of the many evolving 'omic technologies to organ transplantation. RECENT FINDINGS With the advancement of many high-throughput 'omic techniques such as genomics, metabolomics, antibiomics, peptidomics, and proteomics, efforts have been made to understand potential mechanisms of specific graft injuries and develop novel biomarkers for acute rejection, chronic rejection, and operational tolerance. SUMMARY The translation of potential biomarkers from the laboratory bench to the clinical bedside is not an easy task and will require the concerted effort of the immunologists, molecular biologists, transplantation specialists, geneticists, and experts in bioinformatics. Rigorous prospective validation studies will be needed using large sets of independent patient samples. The appropriate and timely exploitation of evolving 'omic technologies will lay the cornerstone for a new age of translational research for organ transplant monitoring.
Collapse
|
41
|
|
42
|
Popovici V, Chen W, Gallas BG, Hatzis C, Shi W, Samuelson FW, Nikolsky Y, Tsyganova M, Ishkin A, Nikolskaya T, Hess KR, Valero V, Booser D, Delorenzi M, Hortobagyi GN, Shi L, Symmans WF, Pusztai L. Effect of training-sample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Res 2010; 12:R5. [PMID: 20064235 PMCID: PMC2880423 DOI: 10.1186/bcr2468] [Citation(s) in RCA: 146] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2009] [Revised: 12/18/2009] [Accepted: 01/11/2010] [Indexed: 12/31/2022] Open
Abstract
Introduction As part of the MicroArray Quality Control (MAQC)-II project, this analysis examines how the choice of univariate feature-selection methods and classification algorithms may influence the performance of genomic predictors under varying degrees of prediction difficulty represented by three clinically relevant endpoints. Methods We used gene-expression data from 230 breast cancers (grouped into training and independent validation sets), and we examined 40 predictors (five univariate feature-selection methods combined with eight different classifiers) for each of the three endpoints. Their classification performance was estimated on the training set by using two different resampling methods and compared with the accuracy observed in the independent validation set. Results A ranking of the three classification problems was obtained, and the performance of 120 models was estimated and assessed on an independent validation set. The bootstrapping estimates were closer to the validation performance than were the cross-validation estimates. The required sample size for each endpoint was estimated, and both gene-level and pathway-level analyses were performed on the obtained models. Conclusions We showed that genomic predictor accuracy is determined largely by an interplay between sample size and classification difficulty. Variations on univariate feature-selection methods and choice of classification algorithm have only a modest impact on predictor performance, and several statistically equally good predictors can be developed for any given classification problem.
Collapse
Affiliation(s)
- Vlad Popovici
- Bioinformatics Core Facility, Swiss Institute of Bioinformatics, Génopode Building, Quartier Sorge, Lausanne CH-1015, Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
43
|
Leung Y, Hung Y. A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:108-117. [PMID: 20150673 DOI: 10.1109/tcbb.2008.46] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Filters and wrappers are two prevailing approaches for gene selection in microarray data analysis. Filters make use of statistical properties of each gene to represent its discriminating power between different classes. The computation is fast but the predictions are inaccurate. Wrappers make use of a chosen classifier to select genes by maximizing classification accuracy, but the computation burden is formidable. Filters and wrappers have been combined in previous studies to maximize the classification accuracy for a chosen classifier with respect to a filtered set of genes. The drawback of this single-filter-single-wrapper (SFSW) approach is that the classification accuracy is dependent on the choice of specific filter and wrapper. In this paper, a multiple-filter-multiple-wrapper (MFMW) approach is proposed that makes use of multiple filters and multiple wrappers to improve the accuracy and robustness of the classification, and to identify potential biomarker genes. Experiments based on six benchmark data sets show that the MFMW approach outperforms SFSW models (generated by all combinations of filters and wrappers used in the corresponding MFMW model) in all cases and for all six data sets. Some of MFMW-selected genes have been confirmed to be biomarkers or contribute to the development of particular cancers by other studies.
Collapse
Affiliation(s)
- Yukyee Leung
- Department of Electrical and Electronic Engineering, Chow Yei Ching Building, University of Hong Kong, Pokfulam Road, Hong Kong.
| | | |
Collapse
|
44
|
Sontrop HMJ, Moerland PD, van den Ham R, Reinders MJT, Verhaegh WFJ. A comprehensive sensitivity analysis of microarray breast cancer classification under feature variability. BMC Bioinformatics 2009; 10:389. [PMID: 19941644 PMCID: PMC2789744 DOI: 10.1186/1471-2105-10-389] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2009] [Accepted: 11/26/2009] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Large discrepancies in signature composition and outcome concordance have been observed between different microarray breast cancer expression profiling studies. This is often ascribed to differences in array platform as well as biological variability. We conjecture that other reasons for the observed discrepancies are the measurement error associated with each feature and the choice of preprocessing method. Microarray data are known to be subject to technical variation and the confidence intervals around individual point estimates of expression levels can be wide. Furthermore, the estimated expression values also vary depending on the selected preprocessing scheme. In microarray breast cancer classification studies, however, these two forms of feature variability are almost always ignored and hence their exact role is unclear. RESULTS We have performed a comprehensive sensitivity analysis of microarray breast cancer classification under the two types of feature variability mentioned above. We used data from six state of the art preprocessing methods, using a compendium consisting of eight different datasets, involving 1131 hybridizations, containing data from both one and two-color array technology. For a wide range of classifiers, we performed a joint study on performance, concordance and stability. In the stability analysis we explicitly tested classifiers for their noise tolerance by using perturbed expression profiles that are based on uncertainty information directly related to the preprocessing methods. Our results indicate that signature composition is strongly influenced by feature variability, even if the array platform and the stratification of patient samples are identical. In addition, we show that there is often a high level of discordance between individual class assignments for signatures constructed on data coming from different preprocessing schemes, even if the actual signature composition is identical. CONCLUSION Feature variability can have a strong impact on breast cancer signature composition, as well as the classification of individual patient samples. We therefore strongly recommend that feature variability is considered in analyzing data from microarray breast cancer expression profiling experiments.
Collapse
|
45
|
Abstract
DNA microarrays are powerful tools for studying biological mechanisms and for developing prognostic and predictive classifiers for identifying the patients who require treatment and are best candidates for specific treatments. Because microarrays produce so much data from each specimen, they offer great opportunities for discovery and great dangers or producing misleading claims. Microarray based studies require clear objectives for selecting cases and appropriate analysis methods. Effective analysis of microarray data, where the number of measured variables is orders of magnitude greater than the number of cases, requires specialized statistical methods which have recently been developed. Recent literature reviews indicate that serious problems of analysis exist a substantial proportion of publications. This manuscript attempts to provide a non-technical summary of the key principles of statistical design and analysis for studies that utilize microarray expression profiling.
Collapse
Affiliation(s)
- Richard Simon
- Biometric Research Branch, Division of Cancer Treatment & Diagnosis, National Cancer Institute, 9000 Rockville Pike, Bethesda, MD 20892-7434, USA.
| |
Collapse
|
46
|
de Groot MJL, van Berlo RJP, van Winden WA, Verheijen PJT, Reinders MJT, de Ridder D. Metabolite and reaction inference based on enzyme specificities. ACTA ACUST UNITED AC 2009; 25:2975-82. [PMID: 19696044 PMCID: PMC2773254 DOI: 10.1093/bioinformatics/btp507] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Motivation: Many enzymes are not absolutely specific, or even promiscuous: they can catalyze transformations of more compounds than the traditional ones as listed in, e.g. KEGG. This information is currently only available in databases, such as the BRENDA enzyme activity database. In this article, we propose to model enzyme aspecificity by predicting whether an input compound is likely to be transformed by a certain enzyme. Such a predictor has many applications, for example, to complete reconstructed metabolic networks, to aid in metabolic engineering or to help identify unknown peaks in mass spectra. Results: We have developed a system for metabolite and reaction inference based on enzyme specificities (MaRIboES). It employs structural and stereochemistry similarity measures and molecular fingerprints to generalize enzymatic reactions based on data available in BRENDA. Leave-one-out cross-validation shows that 80% of known reactions are predicted well. Application to the yeast glycolytic and pentose phosphate pathways predicts a large number of known and new reactions, often leading to the formation of novel compounds, as well as a number of interesting bypasses and cross-links. Availability: Matlab and C++ code is freely available at https://gforge.nbic.nl/projects/mariboes/ Contact:d.deridder@tudelft.nl Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- M J L de Groot
- The Delft Bioinformatics Lab, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands
| | | | | | | | | | | |
Collapse
|
47
|
k-Top Scoring Pair Algorithm for feature selection in SVM with applications to microarray data classification. Soft comput 2009. [DOI: 10.1007/s00500-009-0437-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
48
|
Daemen A, Gevaert O, Ojeda F, Debucquoy A, Suykens JA, Sempoux C, Machiels JP, Haustermans K, De Moor B. A kernel-based integration of genome-wide data for clinical decision support. Genome Med 2009; 1:39. [PMID: 19356222 PMCID: PMC2684660 DOI: 10.1186/gm39] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Revised: 03/20/2009] [Accepted: 04/03/2009] [Indexed: 12/19/2022] Open
Abstract
Background Although microarray technology allows the investigation of the transcriptomic make-up of a tumor in one experiment, the transcriptome does not completely reflect the underlying biology due to alternative splicing, post-translational modifications, as well as the influence of pathological conditions (for example, cancer) on transcription and translation. This increases the importance of fusing more than one source of genome-wide data, such as the genome, transcriptome, proteome, and epigenome. The current increase in the amount of available omics data emphasizes the need for a methodological integration framework. Methods We propose a kernel-based approach for clinical decision support in which many genome-wide data sources are combined. Integration occurs within the patient domain at the level of kernel matrices before building the classifier. As supervised classification algorithm, a weighted least squares support vector machine is used. We apply this framework to two cancer cases, namely, a rectal cancer data set containing microarray and proteomics data and a prostate cancer data set containing microarray and genomics data. For both cases, multiple outcomes are predicted. Results For the rectal cancer outcomes, the highest leave-one-out (LOO) areas under the receiver operating characteristic curves (AUC) were obtained when combining microarray and proteomics data gathered during therapy and ranged from 0.927 to 0.987. For prostate cancer, all four outcomes had a better LOO AUC when combining microarray and genomics data, ranging from 0.786 for recurrence to 0.987 for metastasis. Conclusions For both cancer sites the prediction of all outcomes improved when more than one genome-wide data set was considered. This suggests that integrating multiple genome-wide data sources increases the predictive performance of clinical decision support models. This emphasizes the need for comprehensive multi-modal data. We acknowledge that, in a first phase, this will substantially increase costs; however, this is a necessary investment to ultimately obtain cost-efficient models usable in patient tailored therapy.
Collapse
Affiliation(s)
- Anneleen Daemen
- Department of Electrical Engineering (ESAT-SCD), Katholieke Universiteit Leuven, Kasteelpark Arenberg, 3001 Leuven, Belgium.
| | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Annest A, Bumgarner RE, Raftery AE, Yeung KY. Iterative Bayesian Model Averaging: a method for the application of survival analysis to high-dimensional microarray data. BMC Bioinformatics 2009; 10:72. [PMID: 19245714 PMCID: PMC2657791 DOI: 10.1186/1471-2105-10-72] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Accepted: 02/26/2009] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Microarray technology is increasingly used to identify potential biomarkers for cancer prognostics and diagnostics. Previously, we have developed the iterative Bayesian Model Averaging (BMA) algorithm for use in classification. Here, we extend the iterative BMA algorithm for application to survival analysis on high-dimensional microarray data. The main goal in applying survival analysis to microarray data is to determine a highly predictive model of patients' time to event (such as death, relapse, or metastasis) using a small number of selected genes. Our multivariate procedure combines the effectiveness of multiple contending models by calculating the weighted average of their posterior probability distributions. Our results demonstrate that our iterative BMA algorithm for survival analysis achieves high prediction accuracy while consistently selecting a small and cost-effective number of predictor genes. RESULTS We applied the iterative BMA algorithm to two cancer datasets: breast cancer and diffuse large B-cell lymphoma (DLBCL) data. On the breast cancer data, the algorithm selected a total of 15 predictor genes across 84 contending models from the training data. The maximum likelihood estimates of the selected genes and the posterior probabilities of the selected models from the training data were used to divide patients in the test (or validation) dataset into high- and low-risk categories. Using the genes and models determined from the training data, we assigned patients from the test data into highly distinct risk groups (as indicated by a p-value of 7.26e-05 from the log-rank test). Moreover, we achieved comparable results using only the 5 top selected genes with 100% posterior probabilities. On the DLBCL data, our iterative BMA procedure selected a total of 25 genes across 3 contending models from the training data. Once again, we assigned the patients in the validation set to significantly distinct risk groups (p-value = 0.00139). CONCLUSION The strength of the iterative BMA algorithm for survival analysis lies in its ability to account for model uncertainty. The results from this study demonstrate that our procedure selects a small number of genes while eclipsing other methods in predictive performance, making it a highly accurate and cost-effective prognostic tool in the clinical setting.
Collapse
Affiliation(s)
- Amalia Annest
- Institute of Technology/Computing and Software Systems, Box 358426, University of Washington, Tacoma, WA 98402, USA
| | - Roger E Bumgarner
- Department of Microbiology, Box 358070, University of Washington, Seattle, WA 98195, USA
| | - Adrian E Raftery
- Department of Statistics, Box 354320, University of Washington, Seattle, WA 98195, USA
| | - Ka Yee Yeung
- Department of Microbiology, Box 358070, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
50
|
Abstract
The desire for biomarkers for diagnosis and prognosis of diseases has never been greater. With the availability of genome data and an increased availability of proteome data, the discovery of biomarkers has become increasingly feasible. However, the task is daunting and requires collaborations among researchers working in the fields of transplantation, immunology, genetics, molecular biology, biostatistics and bioinformatics. With the advancement of high throughput omic techniques such as genomics and proteomics (collectively known as proteogenomics), efforts have been made to develop diagnostic tools from new and to-be discovered biomarkers. Yet biomarker validation, particularly in organ transplantation, remains challenging because of the lack of a true gold standard for diagnostic categories and analytical bottlenecks that face high-throughput data deconvolution. Even though microarray technique is relatively mature, proteomics is still growing with regards to data normalization and analysis methods. Study design, sample selection and rigorous data analysis are the critical issues for biomarker discovery using high-throughput proteogenomic technologies that combine the use and strengths of both genomics and proteomics. In this review, we look into the current status and latest developments in the field of biomarker discovery using genomics and proteomics related to organ transplantation, with an emphasis on the evolution of proteomic technologies.
Collapse
Affiliation(s)
- Tara K Sigdel
- Department of Pediatrics-Nephrology, Stanford University Medical School, Stanford University, Stanford, CA 94305, USA
| | | |
Collapse
|