Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lai C, Reinders MJT, van't Veer LJ, Wessels LFA. A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinformatics 2006;7:235. [PMID: 16670007 PMCID: PMC1569875 DOI: 10.1186/1471-2105-7-235] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2005] [Accepted: 05/02/2006] [Indexed: 11/25/2022] Open

For:	Lai C, Reinders MJT, van't Veer LJ, Wessels LFA. A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinformatics 2006;7:235. [PMID: 16670007 PMCID: PMC1569875 DOI: 10.1186/1471-2105-7-235] [Citation(s) in RCA: 87] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2005] [Accepted: 05/02/2006] [Indexed: 11/25/2022] Open

Number

Cited by Other Article(s)

Bhardwaj P, Tyagi A, Tyagi S, Antão J, Deng Q. Machine learning model for classification of predominantly allergic and non-allergic asthma among preschool children with asthma hospitalization. J Asthma 2023;60:487-495. [PMID: 35344453 DOI: 10.1080/02770903.2022.2059763] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

Abstract

OBJECTIVE

Asthma is the most frequent chronic airway illness in preschool children and is difficult to diagnose due to the disease's heterogeneity. This study aimed to investigate different machine learning models and suggested the most effective one to classify two forms of asthma in preschool children (predominantly allergic asthma and non-allergic asthma) using a minimum number of features.

METHODS

After pre-processing, 127 patients (70 with non-allergic asthma and 57 with predominantly allergic asthma) were chosen for final analysis from the Frankfurt dataset, which had asthma-related information on 205 patients. The Random Forest algorithm and Chi-square were used to select the key features from a total of 63 features. Six machine learning models: random forest, extreme gradient boosting, support vector machines, adaptive boosting, extra tree classifier, and logistic regression were then trained and tested using 10-fold stratified cross-validation.

RESULTS

Among all features, age, weight, C-reactive protein, eosinophilic granulocytes, oxygen saturation, pre-medication inhaled corticosteroid + long-acting beta2-agonist (PM-ICS + LABA), PM-other (other pre-medication), H-Pulmicort/celestamine (Pulmicort/celestamine during hospitalization), and H-azithromycin (azithromycin during hospitalization) were found to be highly important. The support vector machine approach with a linear kernel was able to diffrentiate between predominantly allergic asthma and non-allergic asthma with higher accuracy (77.8%), precision (0.81), with a true positive rate of 0.73 and a true negative rate of 0.81, a F1 score of 0.81, and a ROC-AUC score of 0.79. Logistic regression was found to be the second-best classifier with an overall accuracy of 76.2%.

CONCLUSION

Predominantly allergic and non-allergic asthma can be classified using machine learning approaches based on nine features.

Supplemental data for this article is available online at at www.tandfonline.com/ijas .

Collapse

A combinatory algorithm for identifying genes in childhood acute lymphoblastic leukemia. GENE REPORTS 2022. [DOI: 10.1016/j.genrep.2021.101433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]

Bhardwaj P, Tiwari P, Olejar K, Parr W, Kulasiri D. A machine learning application in wine quality prediction. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Tyagi A, Tiwari P, Bhardwaj P, Chawla H. Prognosis of sexual dimorphism with unfused hyoid bone: Artificial intelligence informed decision making with discriminant analysis. Sci Justice 2021;61:789-796. [PMID: 34802653 DOI: 10.1016/j.scijus.2021.10.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 06/22/2021] [Accepted: 10/04/2021] [Indexed: 11/18/2022]

Hameed SS, Hassan R, Hassan WH, Muhammadsharif FF, Latiff LA. HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets. PLoS One 2021;16:e0246039. [PMID: 33507983 PMCID: PMC7842997 DOI: 10.1371/journal.pone.0246039] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Accepted: 01/12/2021] [Indexed: 11/24/2022] Open

Das S, Rai SN. Statistical Approach for Biologically Relevant Gene Selection from High-Throughput Gene Expression Data. ENTROPY (BASEL, SWITZERLAND) 2020;22:E1205. [PMID: 33286973 PMCID: PMC7712650 DOI: 10.3390/e22111205] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Revised: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 12/16/2022]

Huang S, Blatti C, Sinha S, Parameswaran A. Uncovering Effective Explanations for Interactive Genomic Data Analysis. PATTERNS 2020;1:100093. [PMID: 33205133 PMCID: PMC7660438 DOI: 10.1016/j.patter.2020.100093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 07/13/2020] [Accepted: 08/05/2020] [Indexed: 10/25/2022]

Cherlin S, Wason JMS. Developing and testing high‐efficacy patient subgroups within a clinical trial using risk scores. Stat Med 2020;39:3285-3298. [PMID: 32662542 PMCID: PMC7611900 DOI: 10.1002/sim.8665] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 03/18/2020] [Accepted: 05/28/2020] [Indexed: 12/13/2022]

Considine EC. The Search for Clinically Useful Biomarkers of Complex Disease: A Data Analysis Perspective. Metabolites 2019;9:E126. [PMID: 31269649 PMCID: PMC6680669 DOI: 10.3390/metabo9070126] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Revised: 06/20/2019] [Accepted: 06/28/2019] [Indexed: 12/25/2022] Open

Bhowmick SS, Bhattacharjee D, Rato L. In silico markers: an evolutionary and statistical approach to select informative genes of human breast cancer subtypes. Genes Genomics 2019;41:1371-1382. [PMID: 31004329 DOI: 10.1007/s13258-019-00816-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2018] [Accepted: 04/02/2019] [Indexed: 10/27/2022]

Emura T, Matsui S, Chen HY. compound.Cox: Univariate feature selection and compound covariate for predicting survival. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019;168:21-37. [PMID: 30527130 DOI: 10.1016/j.cmpb.2018.10.020] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Revised: 09/26/2018] [Accepted: 10/26/2018] [Indexed: 05/15/2023]

Wu HC, Wei XG, Chan SC. Novel Consensus Gene Selection Criteria for Distributed GPU Partial Least Squares-Based Gene Microarray Analysis in Diffused Large B Cell Lymphoma (DLBCL) and Related Findings. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018;15:2039-2052. [PMID: 28991749 DOI: 10.1109/tcbb.2017.2760827] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

Statistical approach for selection of biologically informative genes. Gene 2018;655:71-83. [PMID: 29458166 DOI: 10.1016/j.gene.2018.02.044] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Revised: 11/26/2017] [Accepted: 02/14/2018] [Indexed: 11/23/2022]

Abstract

Selection of informative genes from high dimensional gene expression data has emerged as an important research area in genomics. Many gene selection techniques have been proposed so far are either based on relevancy or redundancy measure. Further, the performance of these techniques has been adjudged through post selection classification accuracy computed through a classifier using the selected genes. This performance metric may be statistically sound but may not be biologically relevant. A statistical approach, i.e. Boot-MRMR, was proposed based on a composite measure of maximum relevance and minimum redundancy, which is both statistically sound and biologically relevant for informative gene selection. For comparative evaluation of the proposed approach, we developed two biological sufficient criteria, i.e. Gene Set Enrichment with QTL (GSEQ) and biological similarity score based on Gene Ontology (GO). Further, a systematic and rigorous evaluation of the proposed technique with 12 existing gene selection techniques was carried out using five gene expression datasets. This evaluation was based on a broad spectrum of statistically sound (e.g. subject classification) and biological relevant (based on QTL and GO) criteria under a multiple criteria decision-making framework. The performance analysis showed that the proposed technique selects informative genes which are more biologically relevant. The proposed technique is also found to be quite competitive with the existing techniques with respect to subject classification and computational time. Our results also showed that under the multiple criteria decision-making setup, the proposed technique is best for informative gene selection over the available alternatives. Based on the proposed approach, an R Package, i.e. BootMRMR has been developed and available at https://cran.r-project.org/web/packages/BootMRMR. This study will provide a practical guide to select statistical techniques for selecting informative genes from high dimensional expression data for breeding and system biology studies.

Collapse

Hameed SS, Hassan R, Muhammad FF. Selection and classification of gene expression in autism disorder: Use of a combination of statistical filters and a GBPSO-SVM algorithm. PLoS One 2017;12:e0187371. [PMID: 29095904 PMCID: PMC5667738 DOI: 10.1371/journal.pone.0187371] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2017] [Accepted: 10/18/2017] [Indexed: 11/30/2022] Open

Alexe G, Dalgin G, Ramaswamy R, Delisi C, Bhanot G. Data Perturbation Independent Diagnosis and Validation of Breast Cancer Subtypes Using Clustering and Patterns. Cancer Inform 2017. [DOI: 10.1177/117693510600200006] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open

Damon C, Luck M, Toullec L, Etienne I, Buchler M, Hurault de Ligny B, Choukroun G, Thierry A, Vigneau C, Moulin B, Heng AE, Subra JF, Legendre C, Monnot A, Yartseva A, Bateson M, Laurent-Puig P, Anglicheau D, Beaune P, Loriot MA, Thervet E, Pallet N. Predictive Modeling of Tacrolimus Dose Requirement Based on High-Throughput Genetic Screening. Am J Transplant 2017;17:1008-1019. [PMID: 27597269 DOI: 10.1111/ajt.14040] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Revised: 08/24/2016] [Accepted: 08/26/2016] [Indexed: 01/25/2023]

Affiliation(s)

C Damon Hypercube Institute, Paris, France
M Luck Hypercube Institute, Paris, France.,Paris Descartes University, Paris, France
L Toullec Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
I Etienne Department of Nephrology, Rouen University Hospital, Rouen, France
M Buchler Department of Nephrology, Tours University Hospital, Tours, France
B Hurault de Ligny Department of Nephrology, Caen University Hospital, Caen, France
G Choukroun Department of Nephrology, Amiens University Hospital, Amiens, France
A Thierry Department of Nephrology, Poitiers University Hospital, Poitiers, France
C Vigneau Department of Nephrology, Rennes University Hospital, Rennes, France
B Moulin Department of Nephrology, Strasbourg University Hospital, Strasbourg, France
A-E Heng Department of Nephrology, Clermont-Ferrand University Hospital, Clermont-Ferrand, France
J-F Subra Department of Nephrology, Angers University Hospital, Angers, France
C Legendre Department of Nephrology, Necker Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
A Monnot Hypercube Institute, Paris, France
A Yartseva Hypercube Institute, Paris, France
M Bateson Hypercube Institute, Paris, France
P Laurent-Puig Paris Descartes University, Paris, France.,Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France.,Institut National pour la Santé et la Recherche Médicale (INSERM) U1147, Paris, France
D Anglicheau Department of Nephrology, Necker Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
P Beaune Paris Descartes University, Paris, France.,Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France.,Institut National pour la Santé et la Recherche Médicale (INSERM) U1147, Paris, France
M A Loriot Paris Descartes University, Paris, France.,Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France.,Institut National pour la Santé et la Recherche Médicale (INSERM) U1147, Paris, France
E Thervet Paris Descartes University, Paris, France.,Department of Nephrology, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France
N Pallet Paris Descartes University, Paris, France.,Department of Clinical Chemistry, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France.,Institut National pour la Santé et la Recherche Médicale (INSERM) U1147, Paris, France.,Department of Nephrology, Georges Pompidou European Hospital, Assistance Publique Hôpitaux de Paris, Paris, France

Collapse

A Meta-Review of Feature Selection Techniques in the Context of Microarray Data. BIOINFORMATICS AND BIOMEDICAL ENGINEERING 2017. [DOI: 10.1007/978-3-319-56148-6_3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]

Bari MG, Salekin S, Zhang JM. A Robust and Efficient Feature Selection Algorithm for Microarray Data. Mol Inform 2016;36. [PMID: 28000384 DOI: 10.1002/minf.201600099] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2016] [Accepted: 11/21/2016] [Indexed: 12/20/2022]

Sardana M, Agrawal R, Kaur B. A hybrid of clustering and quantum genetic algorithm for relevant genes selection for cancer microarray data. INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS 2016. [DOI: 10.3233/kes-160341] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Tabakhi S, Najafi A, Ranjbar R, Moradi P. Gene selection for microarray data classification using a novel ant colony optimization. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2015.05.022] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Boulesteix AL, Hable R, Lauer S, Eugster MJA. A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies. AM STAT 2015. [DOI: 10.1080/00031305.2015.1005128] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]

Drotár P, Gazda J, Smékal Z. An experimental comparison of feature selection methods on two-class biomedical datasets. Comput Biol Med 2015;66:1-10. [PMID: 26327447 DOI: 10.1016/j.compbiomed.2015.08.010] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Revised: 08/05/2015] [Accepted: 08/12/2015] [Indexed: 11/30/2022]

Hemphill E, Lindsay J, Lee C, Măndoiu II, Nelson CE. Feature selection and classifier performance on diverse bio- logical datasets. BMC Bioinformatics 2014;15 Suppl 13:S4. [PMID: 25434802 PMCID: PMC4248652 DOI: 10.1186/1471-2105-15-s13-s4] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Abstract

Background

There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types.

Results

This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness.

Conclusions

As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data.

Interestingly, as the number of selected biomarkers increases best performing classifiers based on SNP data match or slightly outperform those based on gene and protein expression, while those based on aCGH and microRNA data continue to perform the worst. It is observed that one class of feature selection and classifier are consistently top performers across data types and number of markers, suggesting that well performing feature-selection/classifier pairings are likely to be robust in biological classification problems regardless of the data type used in the analysis.

Collapse

Wang X. Identification of Marker Genes for Cancer Based on Microarrays Using a Computational Biology Approach. Curr Bioinform 2014;9:140-146. [PMID: 24683388 DOI: 10.2174/1574893608999140109115649] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Leiva R, Roy A. Classification of Higher-order Data with Separable Covariance and Structured Multiplicative or Additive Mean Models. COMMUN STAT-THEOR M 2014. [DOI: 10.1080/03610926.2013.841931] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]

A comparative analysis of biomarker selection techniques. BIOMED RESEARCH INTERNATIONAL 2013;2013:387673. [PMID: 24324960 PMCID: PMC3842054 DOI: 10.1155/2013/387673] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/23/2013] [Revised: 09/22/2013] [Accepted: 09/23/2013] [Indexed: 11/17/2022]

Genomic biomarkers for personalized medicine: development and validation in clinical studies. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013;2013:865980. [PMID: 23690882 PMCID: PMC3652056 DOI: 10.1155/2013/865980] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2013] [Accepted: 03/22/2013] [Indexed: 12/26/2022]

Abdel Samee NM, Solouma NH, Kadah YM. Detection of biomarkers for hepatocellular carcinoma using a hybrid univariate gene selection methods. Theor Biol Med Model 2012;9:34. [PMID: 22867264 PMCID: PMC3570375 DOI: 10.1186/1742-4682-9-34] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2012] [Accepted: 07/03/2012] [Indexed: 05/26/2023] Open

Abstract

Background

Discovering new biomarkers has a great role in improving early diagnosis of Hepatocellular carcinoma (HCC). The experimental determination of biomarkers needs a lot of time and money. This motivates this work to use in-silico prediction of biomarkers to reduce the number of experiments required for detecting new ones. This is achieved by extracting the most representative genes in microarrays of HCC.

Results

In this work, we provide a method for extracting the differential expressed genes, up regulated ones, that can be considered candidate biomarkers in high throughput microarrays of HCC. We examine the power of several gene selection methods (such as Pearson’s correlation coefficient, Cosine coefficient, Euclidean distance, Mutual information and Entropy with different estimators) in selecting informative genes. A biological interpretation of the highly ranked genes is done using KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, ENTREZ and DAVID (Database for Annotation, Visualization, and Integrated Discovery) databases. The top ten genes selected using Pearson’s correlation coefficient and Cosine coefficient contained six genes that have been implicated in cancer (often multiple cancers) genesis in previous studies. A fewer number of genes were obtained by the other methods (4 genes using Mutual information, 3genes using Euclidean distance and only one gene using Entropy). A better result was obtained by the utilization of a hybrid approach based on intersecting the highly ranked genes in the output of all investigated methods. This hybrid combination yielded seven genes (2 genes for HCC and 5 genes in different types of cancer) in the top ten genes of the list of intersected genes.

Conclusions

To strengthen the effectiveness of the univariate selection methods, we propose a hybrid approach by intersecting several of these methods in a cascaded manner. This approach surpasses all of univariate selection methods when used individually according to biological interpretation and the examination of gene expression signal profiles.

Collapse

Siebourg J, Merdes G, Misselwitz B, Hardt WD, Beerenwinkel N. Stability of gene rankings from RNAi screens. ACTA ACUST UNITED AC 2012;28:1612-8. [PMID: 22513992 DOI: 10.1093/bioinformatics/bts192] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]

Tapia E, Bulacio P, Angelone L. Sparse and stable gene selection with consensus SVM-RFE. Pattern Recognit Lett 2012. [DOI: 10.1016/j.patrec.2011.09.031] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]

Haury AC, Gestraud P, Vert JP. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PLoS One 2011;6:e28210. [PMID: 22205940 PMCID: PMC3244389 DOI: 10.1371/journal.pone.0028210] [Citation(s) in RCA: 159] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2011] [Accepted: 11/03/2011] [Indexed: 01/08/2023] Open

Robust two-gene classifiers for cancer prediction. Genomics 2011;99:90-5. [PMID: 22138042 DOI: 10.1016/j.ygeno.2011.11.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 11/04/2011] [Accepted: 11/09/2011] [Indexed: 11/23/2022]

Wang X, Simon R. Microarray-based cancer prediction using single genes. BMC Bioinformatics 2011;12:391. [PMID: 21982331 PMCID: PMC3228540 DOI: 10.1186/1471-2105-12-391] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Accepted: 10/07/2011] [Indexed: 11/23/2022] Open

Shi P, Ray S, Zhu Q, Kon MA. Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. BMC Bioinformatics 2011;12:375. [PMID: 21939564 PMCID: PMC3223741 DOI: 10.1186/1471-2105-12-375] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2010] [Accepted: 09/23/2011] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

The widely used k top scoring pair (k-TSP) algorithm is a simple yet powerful parameter-free classifier. It owes its success in many cancer microarray datasets to an effective feature selection algorithm that is based on relative expression ordering of gene pairs. However, its general robustness does not extend to some difficult datasets, such as those involving cancer outcome prediction, which may be due to the relatively simple voting scheme used by the classifier. We believe that the performance can be enhanced by separating its effective feature selection component and combining it with a powerful classifier such as the support vector machine (SVM). More generally the top scoring pairs generated by the k-TSP ranking algorithm can be used as a dimensionally reduced subspace for other machine learning classifiers.

RESULTS

We developed an approach integrating the k-TSP ranking algorithm (TSP) with other machine learning methods, allowing combination of the computationally efficient, multivariate feature ranking of k-TSP with multivariate classifiers such as SVM. We evaluated this hybrid scheme (k-TSP+SVM) in a range of simulated datasets with known data structures. As compared with other feature selection methods, such as a univariate method similar to Fisher's discriminant criterion (Fisher), or a recursive feature elimination embedded in SVM (RFE), TSP is increasingly more effective than the other two methods as the informative genes become progressively more correlated, which is demonstrated both in terms of the classification performance and the ability to recover true informative genes. We also applied this hybrid scheme to four cancer prognosis datasets, in which k-TSP+SVM outperforms k-TSP classifier in all datasets, and achieves either comparable or superior performance to that using SVM alone. In concurrence with what is observed in simulation, TSP appears to be a better feature selector than Fisher and RFE in some of the cancer datasets

CONCLUSIONS

The k-TSP ranking algorithm can be used as a computationally efficient, multivariate filter method for feature selection in machine learning. SVM in combination with k-TSP ranking algorithm outperforms k-TSP and SVM alone in simulated datasets and in some cancer prognosis datasets. Simulation studies suggest that as a feature selector, it is better tuned to certain data characteristics, i.e. correlations among informative genes, which is potentially interesting as an alternative feature ranking method in pathway analysis.

Collapse

Muselli M, Bertoni A, Frasca M, Beghini A, Ruffino F, Valentini G. A mathematical model for the validation of gene selection methods. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011;8:1385-1392. [PMID: 21778526 DOI: 10.1109/tcbb.2010.83] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]

On the fusion of threshold classifiers for categorization and dimensionality reduction. Comput Stat 2011. [DOI: 10.1007/s00180-011-0243-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

Kim KI, Simon R. Probabilistic classifiers with high-dimensional data. Biostatistics 2010;12:399-412. [PMID: 21087946 DOI: 10.1093/biostatistics/kxq069] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Mi Z, Shen K, Song N, Cheng C, Song C, Kaminski N, Tseng GC. Module-based prediction approach for robust inter-study predictions in microarray data. ACTA ACUST UNITED AC 2010;26:2586-93. [PMID: 20719761 DOI: 10.1093/bioinformatics/btq472] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]

Abraham G, Kowalczyk A, Loi S, Haviv I, Zobel J. Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context. BMC Bioinformatics 2010;11:277. [PMID: 20500821 PMCID: PMC2895626 DOI: 10.1186/1471-2105-11-277] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2010] [Accepted: 05/25/2010] [Indexed: 02/08/2023] Open

Deconvoluting the 'omics' for organ transplantation. Curr Opin Organ Transplant 2010;14:544-51. [PMID: 19644370 DOI: 10.1097/mot.0b013e32833068fb] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Mundra P, Rajapakse J. SVM-RFE With MRMR Filter for Gene Selection. IEEE Trans Nanobioscience 2010;9:31-7. [DOI: 10.1109/tnb.2009.2035284] [Citation(s) in RCA: 218] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Popovici V, Chen W, Gallas BG, Hatzis C, Shi W, Samuelson FW, Nikolsky Y, Tsyganova M, Ishkin A, Nikolskaya T, Hess KR, Valero V, Booser D, Delorenzi M, Hortobagyi GN, Shi L, Symmans WF, Pusztai L. Effect of training-sample size and classification difficulty on the accuracy of genomic predictors. Breast Cancer Res 2010;12:R5. [PMID: 20064235 PMCID: PMC2880423 DOI: 10.1186/bcr2468] [Citation(s) in RCA: 146] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2009] [Revised: 12/18/2009] [Accepted: 01/11/2010] [Indexed: 12/31/2022] Open

Leung Y, Hung Y. A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010;7:108-117. [PMID: 20150673 DOI: 10.1109/tcbb.2008.46] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]

Sontrop HMJ, Moerland PD, van den Ham R, Reinders MJT, Verhaegh WFJ. A comprehensive sensitivity analysis of microarray breast cancer classification under feature variability. BMC Bioinformatics 2009;10:389. [PMID: 19941644 PMCID: PMC2789744 DOI: 10.1186/1471-2105-10-389] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2009] [Accepted: 11/26/2009] [Indexed: 01/01/2023] Open

Abstract

BACKGROUND

Large discrepancies in signature composition and outcome concordance have been observed between different microarray breast cancer expression profiling studies. This is often ascribed to differences in array platform as well as biological variability. We conjecture that other reasons for the observed discrepancies are the measurement error associated with each feature and the choice of preprocessing method. Microarray data are known to be subject to technical variation and the confidence intervals around individual point estimates of expression levels can be wide. Furthermore, the estimated expression values also vary depending on the selected preprocessing scheme. In microarray breast cancer classification studies, however, these two forms of feature variability are almost always ignored and hence their exact role is unclear.

RESULTS

We have performed a comprehensive sensitivity analysis of microarray breast cancer classification under the two types of feature variability mentioned above. We used data from six state of the art preprocessing methods, using a compendium consisting of eight different datasets, involving 1131 hybridizations, containing data from both one and two-color array technology. For a wide range of classifiers, we performed a joint study on performance, concordance and stability. In the stability analysis we explicitly tested classifiers for their noise tolerance by using perturbed expression profiles that are based on uncertainty information directly related to the preprocessing methods. Our results indicate that signature composition is strongly influenced by feature variability, even if the array platform and the stratification of patient samples are identical. In addition, we show that there is often a high level of discordance between individual class assignments for signatures constructed on data coming from different preprocessing schemes, even if the actual signature composition is identical.

CONCLUSION

Feature variability can have a strong impact on breast cancer signature composition, as well as the classification of individual patient samples. We therefore strongly recommend that feature variability is considered in analyzing data from microarray breast cancer expression profiling experiments.

Collapse

Simon R. Analysis of DNA microarray expression data. Best Pract Res Clin Haematol 2009;22:271-82. [PMID: 19698933 DOI: 10.1016/j.beha.2009.07.001] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]

de Groot MJL, van Berlo RJP, van Winden WA, Verheijen PJT, Reinders MJT, de Ridder D. Metabolite and reaction inference based on enzyme specificities. ACTA ACUST UNITED AC 2009;25:2975-82. [PMID: 19696044 PMCID: PMC2773254 DOI: 10.1093/bioinformatics/btp507] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]

k-Top Scoring Pair Algorithm for feature selection in SVM with applications to microarray data classification. Soft comput 2009. [DOI: 10.1007/s00500-009-0437-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]

Daemen A, Gevaert O, Ojeda F, Debucquoy A, Suykens JA, Sempoux C, Machiels JP, Haustermans K, De Moor B. A kernel-based integration of genome-wide data for clinical decision support. Genome Med 2009;1:39. [PMID: 19356222 PMCID: PMC2684660 DOI: 10.1186/gm39] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Revised: 03/20/2009] [Accepted: 04/03/2009] [Indexed: 12/19/2022] Open

Abstract

Background

Although microarray technology allows the investigation of the transcriptomic make-up of a tumor in one experiment, the transcriptome does not completely reflect the underlying biology due to alternative splicing, post-translational modifications, as well as the influence of pathological conditions (for example, cancer) on transcription and translation. This increases the importance of fusing more than one source of genome-wide data, such as the genome, transcriptome, proteome, and epigenome. The current increase in the amount of available omics data emphasizes the need for a methodological integration framework.

Methods

We propose a kernel-based approach for clinical decision support in which many genome-wide data sources are combined. Integration occurs within the patient domain at the level of kernel matrices before building the classifier. As supervised classification algorithm, a weighted least squares support vector machine is used. We apply this framework to two cancer cases, namely, a rectal cancer data set containing microarray and proteomics data and a prostate cancer data set containing microarray and genomics data. For both cases, multiple outcomes are predicted.

Results

For the rectal cancer outcomes, the highest leave-one-out (LOO) areas under the receiver operating characteristic curves (AUC) were obtained when combining microarray and proteomics data gathered during therapy and ranged from 0.927 to 0.987. For prostate cancer, all four outcomes had a better LOO AUC when combining microarray and genomics data, ranging from 0.786 for recurrence to 0.987 for metastasis.

Conclusions

For both cancer sites the prediction of all outcomes improved when more than one genome-wide data set was considered. This suggests that integrating multiple genome-wide data sources increases the predictive performance of clinical decision support models. This emphasizes the need for comprehensive multi-modal data. We acknowledge that, in a first phase, this will substantially increase costs; however, this is a necessary investment to ultimately obtain cost-efficient models usable in patient tailored therapy.

Collapse

Annest A, Bumgarner RE, Raftery AE, Yeung KY. Iterative Bayesian Model Averaging: a method for the application of survival analysis to high-dimensional microarray data. BMC Bioinformatics 2009;10:72. [PMID: 19245714 PMCID: PMC2657791 DOI: 10.1186/1471-2105-10-72] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Accepted: 02/26/2009] [Indexed: 11/17/2022] Open

Abstract

BACKGROUND

Microarray technology is increasingly used to identify potential biomarkers for cancer prognostics and diagnostics. Previously, we have developed the iterative Bayesian Model Averaging (BMA) algorithm for use in classification. Here, we extend the iterative BMA algorithm for application to survival analysis on high-dimensional microarray data. The main goal in applying survival analysis to microarray data is to determine a highly predictive model of patients' time to event (such as death, relapse, or metastasis) using a small number of selected genes. Our multivariate procedure combines the effectiveness of multiple contending models by calculating the weighted average of their posterior probability distributions. Our results demonstrate that our iterative BMA algorithm for survival analysis achieves high prediction accuracy while consistently selecting a small and cost-effective number of predictor genes.

RESULTS

We applied the iterative BMA algorithm to two cancer datasets: breast cancer and diffuse large B-cell lymphoma (DLBCL) data. On the breast cancer data, the algorithm selected a total of 15 predictor genes across 84 contending models from the training data. The maximum likelihood estimates of the selected genes and the posterior probabilities of the selected models from the training data were used to divide patients in the test (or validation) dataset into high- and low-risk categories. Using the genes and models determined from the training data, we assigned patients from the test data into highly distinct risk groups (as indicated by a p-value of 7.26e-05 from the log-rank test). Moreover, we achieved comparable results using only the 5 top selected genes with 100% posterior probabilities. On the DLBCL data, our iterative BMA procedure selected a total of 25 genes across 3 contending models from the training data. Once again, we assigned the patients in the validation set to significantly distinct risk groups (p-value = 0.00139).

CONCLUSION

The strength of the iterative BMA algorithm for survival analysis lies in its ability to account for model uncertainty. The results from this study demonstrate that our procedure selects a small number of genes while eclipsing other methods in predictive performance, making it a highly accurate and cost-effective prognostic tool in the clinical setting.

Collapse

Sigdel TK, Sarwal MM. The proteogenomic path towards biomarker discovery. Pediatr Transplant 2008;12:737-47. [PMID: 18764911 PMCID: PMC2574627 DOI: 10.1111/j.1399-3046.2008.01018.x] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]