1
|
Osuala R, Kushibar K, Garrucho L, Linardos A, Szafranowska Z, Klein S, Glocker B, Diaz O, Lekadir K. Data synthesis and adversarial networks: A review and meta-analysis in cancer imaging. Med Image Anal 2023; 84:102704. [PMID: 36473414 DOI: 10.1016/j.media.2022.102704] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 11/02/2022] [Accepted: 11/21/2022] [Indexed: 11/26/2022]
Abstract
Despite technological and medical advances, the detection, interpretation, and treatment of cancer based on imaging data continue to pose significant challenges. These include inter-observer variability, class imbalance, dataset shifts, inter- and intra-tumour heterogeneity, malignancy determination, and treatment effect uncertainty. Given the recent advancements in image synthesis, Generative Adversarial Networks (GANs), and adversarial training, we assess the potential of these technologies to address a number of key challenges of cancer imaging. We categorise these challenges into (a) data scarcity and imbalance, (b) data access and privacy, (c) data annotation and segmentation, (d) cancer detection and diagnosis, and (e) tumour profiling, treatment planning and monitoring. Based on our analysis of 164 publications that apply adversarial training techniques in the context of cancer imaging, we highlight multiple underexplored solutions with research potential. We further contribute the Synthesis Study Trustworthiness Test (SynTRUST), a meta-analysis framework for assessing the validation rigour of medical image synthesis studies. SynTRUST is based on 26 concrete measures of thoroughness, reproducibility, usefulness, scalability, and tenability. Based on SynTRUST, we analyse 16 of the most promising cancer imaging challenge solutions and observe a high validation rigour in general, but also several desirable improvements. With this work, we strive to bridge the gap between the needs of the clinical cancer imaging community and the current and prospective research on data synthesis and adversarial networks in the artificial intelligence community.
Collapse
Affiliation(s)
- Richard Osuala
- Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Spain.
| | - Kaisar Kushibar
- Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Spain
| | - Lidia Garrucho
- Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Spain
| | - Akis Linardos
- Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Spain
| | - Zuzanna Szafranowska
- Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Spain
| | - Stefan Klein
- Biomedical Imaging Group Rotterdam, Department of Radiology & Nuclear Medicine, Erasmus MC, Rotterdam, The Netherlands
| | - Ben Glocker
- Biomedical Image Analysis Group, Department of Computing, Imperial College London, UK
| | - Oliver Diaz
- Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Spain
| | - Karim Lekadir
- Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona, Spain
| |
Collapse
|
2
|
Lau W, Aaltonen L, Gunn M, Yetisgen M. Automatic Assignment of Radiology Examination Protocols Using Pre-trained Language Models with Knowledge Distillation. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2021:668-676. [PMID: 35308920 PMCID: PMC8861685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Selecting radiology examination protocol is a repetitive, and time-consuming process. In this paper, we present a deep learning approach to automatically assign protocols to computed tomography examinations, by pre-training a domain-specific BERT model (BERTrad). To handle the high data imbalance across exam protocols, we used a knowledge distillation approach that up-sampled the minority classes through data augmentation. We compared classification performance of the described approach with n-gram models using Support Vector Machine (SVM), Gradient Boosting Machine (GBM), and Random Forest (RF) classifiers, as well as the BERTbase model. SVM, GBM and RF achieved macro-averaged F1 scores of 0.45, 0.45, and 0.6 while BERTbase and BERTrad achieved 0.61 and 0.63. Knowledge distillation boosted performance on the minority classes and achieved an F1 score of 0.66.
Collapse
Affiliation(s)
- Wilson Lau
- Department of Biomedical and Health Informatics
| | | | | | - Meliha Yetisgen
- Department of Biomedical and Health Informatics
- Department of Linguistics, University of Washington, Seattle, WA
| |
Collapse
|
3
|
Bottino F, Tagliente E, Pasquini L, Napoli AD, Lucignani M, Figà-Talamanca L, Napolitano A. COVID Mortality Prediction with Machine Learning Methods: A Systematic Review and Critical Appraisal. J Pers Med 2021; 11:893. [PMID: 34575670 PMCID: PMC8467935 DOI: 10.3390/jpm11090893] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Revised: 08/26/2021] [Accepted: 09/03/2021] [Indexed: 12/21/2022] Open
Abstract
More than a year has passed since the report of the first case of coronavirus disease 2019 (COVID), and increasing deaths continue to occur. Minimizing the time required for resource allocation and clinical decision making, such as triage, choice of ventilation modes and admission to the intensive care unit is important. Machine learning techniques are acquiring an increasingly sought-after role in predicting the outcome of COVID patients. Particularly, the use of baseline machine learning techniques is rapidly developing in COVID mortality prediction, since a mortality prediction model could rapidly and effectively help clinical decision-making for COVID patients at imminent risk of death. Recent studies reviewed predictive models for SARS-CoV-2 diagnosis, severity, length of hospital stay, intensive care unit admission or mechanical ventilation modes outcomes; however, systematic reviews focused on prediction of COVID mortality outcome with machine learning methods are lacking in the literature. The present review looked into the studies that implemented machine learning, including deep learning, methods in COVID mortality prediction thus trying to present the existing published literature and to provide possible explanations of the best results that the studies obtained. The study also discussed challenging aspects of current studies, providing suggestions for future developments.
Collapse
Affiliation(s)
- Francesca Bottino
- Medical Physics Department Bambino Gesù Children’s Hospital, Scientific Institute for Research, Hospitalization and Healthcare (IRCCS), 00165 Rome, Italy;
| | - Emanuela Tagliente
- Medical Physics Department Bambino Gesù Children’s Hospital, Scientific Institute for Research, Hospitalization and Healthcare (IRCCS), 00165 Rome, Italy;
| | - Luca Pasquini
- Neuroradiology Unit, NESMOS Department, Sant’Andrea Hospital, La Sapienza University, 00165 Rome, Italy; (L.P.); (A.D.N.)
- Neuroradiology Service, Radiology Department, Memorial Sloan Kettering Cancer Center, New York, NY 1275, USA
| | - Alberto Di Napoli
- Neuroradiology Unit, NESMOS Department, Sant’Andrea Hospital, La Sapienza University, 00165 Rome, Italy; (L.P.); (A.D.N.)
- Radiology Department, Castelli Romani Hospital, 00040 Ariccia (RM), Italy
| | - Martina Lucignani
- Medical Physics Department Bambino Gesù Children’s Hospital, Scientific Institute for Research, Hospitalization and Healthcare (IRCCS), 00165 Rome, Italy;
| | - Lorenzo Figà-Talamanca
- Neuroradiology Unit, Imaging Department, Bambino Gesù Children’s Hospital, Scientific Institute for Research, Hospitalization and Healthcare (IRCCS), 00165 Rome, Italy;
| | - Antonio Napolitano
- Medical Physics Department Bambino Gesù Children’s Hospital, Scientific Institute for Research, Hospitalization and Healthcare (IRCCS), 00165 Rome, Italy;
| |
Collapse
|
4
|
Xu C, Zhu G. Semi-supervised Learning Algorithm Based on Linear Lie Group for Imbalanced Multi-class Classification. Neural Process Lett 2020. [DOI: 10.1007/s11063-020-10287-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
5
|
Abdulrauf Sharifai G, Zainol Z. Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm. Genes (Basel) 2020; 11:genes11070717. [PMID: 32605144 PMCID: PMC7397300 DOI: 10.3390/genes11070717] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Revised: 12/19/2019] [Accepted: 01/07/2020] [Indexed: 11/16/2022] Open
Abstract
The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.
Collapse
Affiliation(s)
- Garba Abdulrauf Sharifai
- Department of Computer Sciences, Yusuf Maitama Sule University, 700222 Kofar Nassarawa, Kano, Nigeria
- School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia;
- Correspondence: ; Tel.: +60-111-317-0481 or +60-194-004-327
| | - Zurinahni Zainol
- School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia;
| |
Collapse
|
6
|
Almilaji O, Smith C, Surgenor S, Clegg A, Williams E, Thomas P, Snook J. Refinement and validation of the IDIOM score for predicting the risk of gastrointestinal cancer in iron deficiency anaemia. BMJ Open Gastroenterol 2020; 7:e000403. [PMID: 32444424 PMCID: PMC7247388 DOI: 10.1136/bmjgast-2020-000403] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 03/30/2020] [Accepted: 04/08/2020] [Indexed: 01/27/2023] Open
Abstract
OBJECTIVE To refine and validate a model for predicting the risk of gastrointestinal (GI) cancer in iron deficiency anaemia (IDA) and to develop an app to facilitate use in clinical practice. DESIGN Three elements: (1) analysis of a dataset of 2390 cases of IDA to validate the predictive value of age, sex, blood haemoglobin concentration (Hb), mean cell volume (MCV) and iron studies on the probability of underlying GI cancer; (2) a pilot study of the benefit of adding faecal immunochemical testing (FIT) into the model; and (3) development of an app based on the model. RESULTS Age, sex and Hb were all strong, independent predictors of the risk of GI cancer, with ORs (95% CI) of 1.05 per year (1.03 to 1.07, p<0.00001), 2.86 for men (2.03 to 4.06, p<0.00001) and 1.03 for each g/L reduction in Hb (1.01 to 1.04, p<0.0001) respectively. An association with MCV was also revealed, with an OR of 1.03 for each fl reduction (1.01 to 1.05, p<0.02). The model was confirmed to be robust by an internal validation exercise. In the pilot study of high-risk cases, FIT was also predictive of GI cancer (OR 6.6, 95% CI 1.6 to 51.8), but the sensitivity was low at 23.5% (95% CI 6.8% to 49.9%). An app based on the model was developed. CONCLUSION This predictive model may help rationalise the use of investigational resources in IDA, by fast-tracking high-risk cases and, with appropriate safeguards, avoiding invasive investigation altogether in those at ultra-low predicted risk.
Collapse
Affiliation(s)
- Orouba Almilaji
- Department of Gastroenterology, Poole Hospital NHS Foundation Trust, Poole, UK
- Clinical Research Unit, Bournemouth University, Bournemouth, Dorset, UK
| | - Carla Smith
- Department of Gastroenterology, Poole Hospital NHS Foundation Trust, Poole, UK
| | - Sue Surgenor
- Department of Gastroenterology, Poole Hospital NHS Foundation Trust, Poole, UK
| | - Andrew Clegg
- Health Technology Assessment Group, University of Central Lancashire, Preston, Lancashire, UK
| | - Elizabeth Williams
- Department of Gastroenterology, Poole Hospital NHS Foundation Trust, Poole, UK
| | - Peter Thomas
- Clinical Research Unit, Bournemouth University, Bournemouth, Dorset, UK
| | - Jonathon Snook
- Department of Gastroenterology, Poole Hospital NHS Foundation Trust, Poole, UK
| |
Collapse
|
7
|
Particle Swarm Optimized Hybrid Kernel-Based Multiclass Support Vector Machine for Microarray Cancer Data Analysis. BIOMED RESEARCH INTERNATIONAL 2019; 2019:4085725. [PMID: 31998772 PMCID: PMC6973196 DOI: 10.1155/2019/4085725] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 10/26/2019] [Accepted: 11/21/2019] [Indexed: 11/17/2022]
Abstract
Determining an optimal decision model is an important but difficult combinatorial task in imbalanced microarray-based cancer classification. Though the multiclass support vector machine (MCSVM) has already made an important contribution in this field, its performance solely depends on three aspects: the penalty factor C, the type of kernel, and its parameters. To improve the performance of this classifier in microarray-based cancer analysis, this paper proposes PSO-PCA-LGP-MCSVM model that is based on particle swarm optimization (PSO), principal component analysis (PCA), and multiclass support vector machine (MCSVM). The MCSVM is based on a hybrid kernel, i.e., linear-Gaussian-polynomial (LGP) that combines the advantages of three standard kernels (linear, Gaussian, and polynomial) in a novel manner, where the linear kernel is linearly combined with the Gaussian kernel embedding the polynomial kernel. Further, this paper proves and makes sure that the LGP kernel confirms the features of a valid kernel. In order to reveal the effectiveness of our model, several experiments were conducted and the obtained results compared between our model and other three single kernel-based models, namely, PSO-PCA-L-MCSVM (utilizing a linear kernel), PSO-PCA-G-MCSVM (utilizing a Gaussian kernel), and PSO-PCA-P-MCSVM (utilizing a polynomial kernel). In comparison, two dual and two multiclass imbalanced standard microarray datasets were used. Experimental results in terms of three extended assessment metrics (F-score, G-mean, and Accuracy) reveal the superior global feature extraction, prediction, and learning abilities of this model against three single kernel-based models.
Collapse
|
8
|
Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes (Basel) 2019; 10:E87. [PMID: 30696086 PMCID: PMC6410075 DOI: 10.3390/genes10020087] [Citation(s) in RCA: 176] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 01/08/2019] [Accepted: 01/21/2019] [Indexed: 12/11/2022] Open
Abstract
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
Collapse
Affiliation(s)
- Bilal Mirza
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Wei Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Jie Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Howard Choi
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Neo Christopher Chung
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland.
| | - Peipei Ping
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Medicine (Cardiology), University of California Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
9
|
Alcaraz N, List M, Batra R, Vandin F, Ditzel HJ, Baumbach J. De novo pathway-based biomarker identification. Nucleic Acids Res 2017; 45:e151. [PMID: 28934488 PMCID: PMC5766193 DOI: 10.1093/nar/gkx642] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2017] [Accepted: 07/13/2017] [Indexed: 02/07/2023] Open
Abstract
Gene expression profiles have been extensively discussed as an aid to guide the therapy by predicting disease outcome for the patients suffering from complex diseases, such as cancer. However, prediction models built upon single-gene (SG) features show poor stability and performance on independent datasets. Attempts to mitigate these drawbacks have led to the development of network-based approaches that integrate pathway information to produce meta-gene (MG) features. Also, MG approaches have only dealt with the two-class problem of good versus poor outcome prediction. Stratifying patients based on their molecular subtypes can provide a detailed view of the disease and lead to more personalized therapies. We propose and discuss a novel MG approach based on de novo pathways, which for the first time have been used as features in a multi-class setting to predict cancer subtypes. Comprehensive evaluation in a large cohort of breast cancer samples from The Cancer Genome Atlas (TCGA) revealed that MGs are considerably more stable than SG models, while also providing valuable insight into the cancer hallmarks that drive them. In addition, when tested on an independent benchmark non-TCGA dataset, MG features consistently outperformed SG models. We provide an easy-to-use web service at http://pathclass.compbio.sdu.dk where users can upload their own gene expression datasets from breast cancer studies and obtain the subtype predictions from all the classifiers.
Collapse
Affiliation(s)
- Nicolas Alcaraz
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark.,Department of Cancer and Inflammation Research, Institute of Molecular Medicine, University of Southern Denmark, 5000 Odense, Denmark.,The Bioinformatics Centre, Department of Biology, University of Copenhagen, 2200 Copenhagen, Denmark
| | - Markus List
- Computational Biology and Applied Algorithms, Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Richa Batra
- Institute of Computational Biology, Helmholtz Zentrum München, 85764 Munich, Germany.,Department of Dermatology and Allergy, Technical University of Munich, 80802 Munich, Germany
| | - Fabio Vandin
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark.,Department of Information and Engineering, University of Padowa, 35122 Padowa, Italy
| | - Henrik J Ditzel
- Department of Cancer and Inflammation Research, Institute of Molecular Medicine, University of Southern Denmark, 5000 Odense, Denmark.,Department of Oncology, Odense University Hospital, 5000 Odense, Denmark
| | - Jan Baumbach
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark.,Computational Systems Biology Group, Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| |
Collapse
|
10
|
Dynamic affinity-based classification of multi-class imbalanced data with one-versus-one decomposition: a fuzzy rough set approach. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1126-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
11
|
Ocampo-Vega R, Sanchez-Ante G, de Luna MA, Vega R, Falcón-Morales LE, Sossa H. Improving pattern classification of DNA microarray data by using PCA and logistic regression. INTELL DATA ANAL 2016. [DOI: 10.3233/ida-160845] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Ricardo Ocampo-Vega
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Gildardo Sanchez-Ante
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Marco A. de Luna
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Roberto Vega
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Luis E. Falcón-Morales
- Data Visualization and Pattern Recognition Lab, Tecnológico de Monterrey, Campus Guadalajara, Zapopan, México
| | - Humberto Sossa
- Instituto Politécnico Nacional-CIC, México, Distrito Federal, México
| |
Collapse
|
12
|
A novel approach for predicting DNA splice junctions using hybrid machine learning algorithms. Soft comput 2014. [DOI: 10.1007/s00500-014-1550-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
13
|
Fortino V, Kinaret P, Fyhrquist N, Alenius H, Greco D. A robust and accurate method for feature selection and prioritization from multi-class OMICs data. PLoS One 2014; 9:e107801. [PMID: 25247789 PMCID: PMC4172658 DOI: 10.1371/journal.pone.0107801] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2014] [Accepted: 08/22/2014] [Indexed: 11/18/2022] Open
Abstract
Selecting relevant features is a common task in most OMICs data analysis, where the aim is to identify a small set of key features to be used as biomarkers. To this end, two alternative but equally valid methods are mainly available, namely the univariate (filter) or the multivariate (wrapper) approach. The stability of the selected lists of features is an often neglected but very important requirement. If the same features are selected in multiple independent iterations, they more likely are reliable biomarkers. In this study, we developed and evaluated the performance of a novel method for feature selection and prioritization, aiming at generating robust and stable sets of features with high predictive power. The proposed method uses the fuzzy logic for a first unbiased feature selection and a Random Forest built from conditional inference trees to prioritize the candidate discriminant features. Analyzing several multi-class gene expression microarray data sets, we demonstrate that our technique provides equal or better classification performance and a greater stability as compared to other Random Forest-based feature selection methods.
Collapse
Affiliation(s)
- Vittorio Fortino
- Unit of Systems Toxicology, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
- Nanosafety Centre, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
| | - Pia Kinaret
- Unit of Systems Toxicology, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
- Nanosafety Centre, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
| | - Nanna Fyhrquist
- Unit of Systems Toxicology, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
- Nanosafety Centre, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
| | - Harri Alenius
- Unit of Systems Toxicology, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
- Nanosafety Centre, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
| | - Dario Greco
- Unit of Systems Toxicology, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
- Nanosafety Centre, Finnish Institute of Occupational Health (FIOH), Helsinki, Finland
- * E-mail:
| |
Collapse
|