1
|
Wang S, Kim SY, Sohn KA. ClearF++: Improved Supervised Feature Scoring Using Feature Clustering in Class-Wise Embedding and Reconstruction. Bioengineering (Basel) 2023; 10:824. [PMID: 37508851 PMCID: PMC10376817 DOI: 10.3390/bioengineering10070824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 06/28/2023] [Accepted: 07/04/2023] [Indexed: 07/30/2023] Open
Abstract
Feature selection methods are essential for accurate disease classification and identifying informative biomarkers. While information-theoretic methods have been widely used, they often exhibit limitations such as high computational costs. Our previously proposed method, ClearF, addresses these issues by using reconstruction error from low-dimensional embeddings as a proxy for the entropy term in the mutual information. However, ClearF still has limitations, including a nontransparent bottleneck layer selection process, which can result in unstable feature selection. To address these limitations, we propose ClearF++, which simplifies the bottleneck layer selection and incorporates feature-wise clustering to enhance biomarker detection. We compare its performance with other commonly used methods such as MultiSURF and IFS, as well as ClearF, across multiple benchmark datasets. Our results demonstrate that ClearF++ consistently outperforms these methods in terms of prediction accuracy and stability, even with limited samples. We also observe that employing the Deep Embedded Clustering (DEC) algorithm for feature-wise clustering improves performance, indicating its suitability for handling complex data structures with limited samples. ClearF++ offers an improved biomarker prioritization approach with enhanced prediction performance and faster execution. Its stability and effectiveness with limited samples make it particularly valuable for biomedical data analysis.
Collapse
Affiliation(s)
- Sehee Wang
- Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea
| | - So Yeon Kim
- Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea
- Department of Software and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea
| | - Kyung-Ah Sohn
- Department of Artificial Intelligence, Ajou University, Suwon 16499, Republic of Korea
- Department of Software and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea
| |
Collapse
|
2
|
Speller J, Staerk C, Mayr A. Robust statistical boosting with quantile-based adaptive loss functions. Int J Biostat 2022:ijb-2021-0127. [PMID: 35950232 DOI: 10.1515/ijb-2021-0127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 06/20/2022] [Indexed: 11/15/2022]
Abstract
We combine robust loss functions with statistical boosting algorithms in an adaptive way to perform variable selection and predictive modelling for potentially high-dimensional biomedical data. To achieve robustness against outliers in the outcome variable (vertical outliers), we consider different composite robust loss functions together with base-learners for linear regression. For composite loss functions, such as the Huber loss and the Bisquare loss, a threshold parameter has to be specified that controls the robustness. In the context of boosting algorithms, we propose an approach that adapts the threshold parameter of composite robust losses in each iteration to the current sizes of residuals, based on a fixed quantile level. We compared the performance of our approach to classical M-regression, boosting with standard loss functions or the lasso regarding prediction accuracy and variable selection in different simulated settings: the adaptive Huber and Bisquare losses led to a better performance when the outcome contained outliers or was affected by specific types of corruption. For non-corrupted data, our approach yielded a similar performance to boosting with the efficient L 2 loss or the lasso. Also in the analysis of skewed KRT19 protein expression data based on gene expression measurements from human cancer cell lines (NCI-60 cell line panel), boosting with the new adaptive loss functions performed favourably compared to standard loss functions or competing robust approaches regarding prediction accuracy and resulted in very sparse models.
Collapse
Affiliation(s)
- Jan Speller
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| | - Christian Staerk
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| | - Andreas Mayr
- Medical Faculty, Institute of Medical Biometrics, Informatics and Epidemiology (IMBIE), University of Bonn, Bonn, Germany
| |
Collapse
|
3
|
M Ascensión A, Ibáñez-Solé O, Inza I, Izeta A, Araúzo-Bravo MJ. Triku: a feature selection method based on nearest neighbors for single-cell data. Gigascience 2022; 11:6547682. [PMID: 35277963 PMCID: PMC8917514 DOI: 10.1093/gigascience/giac017] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Revised: 09/24/2021] [Indexed: 01/03/2023] Open
Abstract
Background Feature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Most of the current feature selection methods are based on general univariate descriptors of the data such as the dispersion or the percentage of zeros. Despite the use of correction methods, the generality of these feature selection methods biases the genes selected towards highly expressed genes, instead of the genes defining the cell populations of the dataset. Results Triku is a feature selection method that favors genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the k-nearest neighbor graph. The expression of these genes is higher than the expected expression if the k-cells were chosen at random. Triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on adjusted Rand index, normalized mutual information, supervised classification, and silhouette coefficient measurements. Additionally, gene sets selected by triku are more likely to be related to relevant Gene Ontology terms and contain fewer ribosomal and mitochondrial genes. Conclusion Triku is developed in Python 3 and is available at https://github.com/alexmascension/triku.
Collapse
Affiliation(s)
- Alex M Ascensión
- Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain
- Biodonostia Health Research Institute, Tissue Engineering Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain
| | - Olga Ibáñez-Solé
- Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain
- Biodonostia Health Research Institute, Tissue Engineering Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain
| | - Iñaki Inza
- Intelligent Systems Group, Computer Science Faculty, University of the Basque Country, Donostia-San Sebastian, 20018, Spain
| | - Ander Izeta
- Biodonostia Health Research Institute, Tissue Engineering Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain
| | - Marcos J Araúzo-Bravo
- Biodonostia Health Research Institute, Computational Biology and Systems Biomedicine Group, Paseo Dr. Begiristain, s/n, Donostia-San Sebastian, 20014, Spain
- Max Planck Institute for Molecular Biomedicine, Roentgenstr. 20, 48149 Muenster, German
- IKERBASQUE, Basque Foundation for Science, Euskadi plaza 5, Bilbao, 48009, Spain
- Department of Cell Biology and Histology, Faculty of Medicine and Nursing, University of Basque Country (UPV/EHU), 48940 Leioa, Spain
| |
Collapse
|
4
|
Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis. BIG DATA AND COGNITIVE COMPUTING 2022. [DOI: 10.3390/bdcc6010024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In high-dimensional data analysis, Feature Selection (FS) is one of the most fundamental issues in machine learning and requires the attention of researchers. These datasets are characterized by huge space due to a high number of features, out of which only a few are significant for analysis. Thus, significant feature extraction is crucial. There are various techniques available for feature selection; among them, the filter techniques are significant in this community, as they can be used with any type of learning algorithm and drastically lower the running time of optimization algorithms and improve the performance of the model. Furthermore, the application of a filter approach depends on the characteristics of the dataset as well as on the machine learning model. Thus, to avoid these issues in this research, a combination of feature reduction (CFR) is considered designing a pipeline of filter approaches for high-dimensional microarray data classification. Considering four filter approaches, sixteen combinations of pipelines are generated. The feature subset is reduced in different levels, and ultimately, the significant feature set is evaluated. The pipelined filter techniques are Correlation-Based Feature Selection (CBFS), Chi-Square Test (CST), Information Gain (InG), and Relief Feature Selection (RFS), and the classification techniques are Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and k-Nearest Neighbor (k-NN). The performance of CFR depends highly on the datasets as well as on the classifiers. Thereafter, the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method is used for ranking all reduction combinations and evaluating the superior filter combination among all.
Collapse
|
5
|
Machine learning algorithms for diabetes detection: a comparative evaluation of performance of algorithms. EVOLUTIONARY INTELLIGENCE 2021. [DOI: 10.1007/s12065-021-00685-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
6
|
Shemirani R, Wenric S, Kenny E, Ambite JL. EPS: Automated Feature Selection in Case-Control Studies using Extreme Pseudo-Sampling. Bioinformatics 2021; 37:3372-3373. [PMID: 33774671 DOI: 10.1093/bioinformatics/btab214] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/24/2021] [Accepted: 03/26/2021] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Finding informative predictive features in high dimensional biological case-control datasets is challenging. The Extreme Pseudo-Sampling (EPS) algorithm offers a solution to the challenge of feature selection via a combination of deep learning and linear regression models. First, using a variational autoencoder, it generates complex latent representations for the samples. Second, it classifies the latent representations of cases and controls via logistic regression. Third, it generates new samples (pseudo-samples) around the extreme cases and controls in the regression model. Finally, it trains a new regression model over the upsampled space. The most significant variables in this regression are selected. We present an open-source implementation of the algorithm that is easy to set up, use, and customize. Our package enhances the original algorithm by providing new features and customizability for data preparation, model training and classification functionalities. We believe the new features will enable the adoption of the algorithm for a diverse range of datasets. AVAILABILITY The software package for Python is available online at https://github.com/roohy/eps.
Collapse
Affiliation(s)
- Ruhollah Shemirani
- Information Sciences Institute, University of Southern California, Marina del Rey, US
| | - Stephane Wenric
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, US
| | - Eimear Kenny
- Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, US
| | - José Luis Ambite
- Information Sciences Institute, University of Southern California, Marina del Rey, US
| |
Collapse
|
7
|
Volkova A, Ruggles KV. Predictive Metagenomic Analysis of Autoimmune Disease Identifies Robust Autoimmunity and Disease Specific Microbial Signatures. Front Microbiol 2021; 12:621310. [PMID: 33746917 PMCID: PMC7969817 DOI: 10.3389/fmicb.2021.621310] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 02/11/2021] [Indexed: 12/12/2022] Open
Abstract
Within the last decade, numerous studies have demonstrated changes in the gut microbiome associated with specific autoimmune diseases. Due to differences in study design, data quality control, analysis and statistical methods, many results of these studies are inconsistent and incomparable. To better understand the relationship between the intestinal microbiome and autoimmunity, we have completed a comprehensive re-analysis of 42 studies focusing on the gut microbiome in 12 autoimmune diseases to identify a microbial signature predictive of multiple sclerosis (MS), inflammatory bowel disease (IBD), rheumatoid arthritis (RA) and general autoimmune disease using both 16S rRNA sequencing data and shotgun metagenomics data. To do this, we used four machine learning algorithms, random forest, eXtreme Gradient Boosting (XGBoost), ridge regression, and support vector machine with radial kernel and recursive feature elimination to rank disease predictive taxa comparing disease vs. healthy participants and pairwise comparisons of each disease. Comparing the performance of these models, we found the two tree-based methods, XGBoost and random forest, most capable of handling sparse multidimensional data, to consistently produce the best results. Through this modeling, we identified a number of taxa consistently identified as dysregulated in a general autoimmune disease model including Odoribacter, Lachnospiraceae Clostridium, and Mogibacteriaceae implicating all as potential factors connecting the gut microbiome to autoimmune response. Further, we computed pairwise comparison models to identify disease specific taxa signatures highlighting a role for Peptostreptococcaceae and Ruminococcaceae Gemmiger in IBD and Akkermansia, Butyricicoccus, and Mogibacteriaceae in MS. We then connected a subset of these taxa with potential metabolic alterations based on metagenomic/metabolomic correlation analysis, identifying 215 metabolites associated with autoimmunity-predictive taxa.
Collapse
Affiliation(s)
- Angelina Volkova
- Institute for Systems Genetics, New York University Grossman School of Medicine, New York, NY, United States
| | - Kelly V. Ruggles
- Institute for Systems Genetics, New York University Grossman School of Medicine, New York, NY, United States
- Division of Translational Medicine, Department of Medicine, New York University Grossman School of Medicine, New York, NY, United States
| |
Collapse
|
8
|
Fortino V, Scala G, Greco D. Feature set optimization in biomarker discovery from genome-scale data. Bioinformatics 2020; 36:3393-3400. [PMID: 32119073 DOI: 10.1093/bioinformatics/btaa144] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 02/20/2020] [Accepted: 02/26/2020] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Omics technologies have the potential to facilitate the discovery of new biomarkers. However, only few omics-derived biomarkers have been successfully translated into clinical applications to date. Feature selection is a crucial step in this process that identifies small sets of features with high predictive power. Models consisting of a limited number of features are not only more robust in analytical terms, but also ensure cost effectiveness and clinical translatability of new biomarker panels. Here we introduce GARBO, a novel multi-island adaptive genetic algorithm to simultaneously optimize accuracy and set size in omics-driven biomarker discovery problems. RESULTS Compared to existing methods, GARBO enables the identification of biomarker sets that best optimize the trade-off between classification accuracy and number of biomarkers. We tested GARBO and six alternative selection methods with two high relevant topics in precision medicine: cancer patient stratification and drug sensitivity prediction. We found multivariate biomarker models from different omics data types such as mRNA, miRNA, copy number variation, mutation and DNA methylation. The top performing models were evaluated by using two different strategies: the Pareto-based selection, and the weighted sum between accuracy and set size (w = 0.5). Pareto-based preferences show the ability of the proposed algorithm to search minimal subsets of relevant features that can be used to model accurate random forest-based classification systems. Moreover, GARBO systematically identified, on larger omics data types, such as gene expression and DNA methylation, biomarker panels exhibiting higher classification accuracy or employing a number of features much lower than those discovered with other methods. These results were confirmed on independent datasets. AVAILABILITY AND IMPLEMENTATION github.com/Greco-Lab/GARBO. CONTACT dario.greco@tuni.fi. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- V Fortino
- Institute of Biomedicine, University of Eastern Finland, Kuopio 70210, Finland
| | - G Scala
- Faculty of Medicine and Health Technology, Tampere University, Tampere 33100, Finland
- Institute of Biotechnology, University of Helsinki, Helsinki 00014, Finland
| | - D Greco
- Faculty of Medicine and Health Technology, Tampere University, Tampere 33100, Finland
- Institute of Biotechnology, University of Helsinki, Helsinki 00014, Finland
| |
Collapse
|
9
|
Torres R, Judson-Torres RL. Research Techniques Made Simple: Feature Selection for Biomarker Discovery. J Invest Dermatol 2020; 139:2068-2074.e1. [PMID: 31543209 DOI: 10.1016/j.jid.2019.07.682] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 06/21/2019] [Accepted: 07/05/2019] [Indexed: 11/19/2022]
Abstract
Molecular biomarkers can be powerful tools for aiding in the efficiency and precision of clinical decision-making. Feature selection methods, machine-learning, and biostatistics have been applied to discover subsets of molecular markers that identify target classes of clinical cases. For example, in the field of dermatology, these approaches have been used to develop predictive models that identify skin diseases, ranging from melanoma to psoriasis, based upon a variety of biomarkers. However, a continuous increase in the variety and size of datasets from which candidate biomarkers can be derived, and limitations in the computational tools used to analyze them, have hindered the interpretability of biomarker discovery studies. In this article, the various methods of feature selection are described along with the important steps needed to properly validate the performance of the selected methods. Limitations and suggestions toward uses of these methods are discussed.
Collapse
Affiliation(s)
- Rodrigo Torres
- Department of Dermatology, University of California, San Francisco, California, USA
| | - Robert L Judson-Torres
- Department of Dermatology, University of California, San Francisco, California, USA; Department of Dermatology, University of Utah School of Medicine, Salt Lake City, Utah, USA; Huntsman Cancer Institute, University of Utah, Salt Lake City, Utah, USA.
| |
Collapse
|
10
|
Robison HM, Escalante P, Valera E, Erskine CL, Auvil L, Sasieta HC, Bushell C, Welge M, Bailey RC. Precision immunoprofiling to reveal diagnostic signatures for latent tuberculosis infection and reactivation risk stratification. Integr Biol (Camb) 2020; 11:16-25. [PMID: 30722034 DOI: 10.1093/intbio/zyz001] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 12/05/2018] [Accepted: 01/02/2019] [Indexed: 11/12/2022]
Abstract
Latent tuberculosis infection (LTBI) is estimated in nearly one quarter of the world's population, and of those immunocompetent and infected ~10% will proceed to active tuberculosis (TB). Current diagnostics cannot definitively identify LTBI and provide no insight into reactivation risk, thereby defining an unmet diagnostic challenge of incredible global significance. We introduce a new machine-learning-driven approach to LTBI diagnostics that leverages a high throughput, multiplexed cytokine detection technology and powerful bioinformatics to reveal multi-marker signatures for LTBI diagnosis and risk stratification. This approach is enabled through an individualized normalization procedure that allows disease-relevant biomarker signatures to be revealed despite heterogeneity in basal immune response. Specifically, cytokines secreted from antigen-challenged peripheral blood mononuclear cells were detected using silicon photonic sensor arrays and multidimensional data correlation of individually-normalized immune responses revealed signatures important for LTBI status. These results demonstrate a powerful combination of multiplexed biomarker detection technologies, precision immune normalization, and feature selection algorithms that revealed positively correlated multi-biomarker signatures for LTBI status and reactivation risk stratification from a relatively simple blood-based assay.
Collapse
Affiliation(s)
- Heather M Robison
- Department of Chemistry, University of Illinois at Urbana-Champaign, 600 South Mathews Avenue, Urbana, IL, USA
| | - Patricio Escalante
- Mycobacterial and Bronchiectasis Clinic, Division of Pulmonary and Critical Care Medicine, Department of Medicine, Mayo Clinic, and Mayo Clinic Center for Tuberculosis, 200 First Street SW, Rochester, MN, USA.,Mayo-Illinois Alliance for Technology-Based Healthcare
| | - Enrique Valera
- Department of Chemistry, University of Illinois at Urbana-Champaign, 600 South Mathews Avenue, Urbana, IL, USA
| | - Courtney L Erskine
- Mycobacterial and Bronchiectasis Clinic, Division of Pulmonary and Critical Care Medicine, Department of Medicine, Mayo Clinic, and Mayo Clinic Center for Tuberculosis, 200 First Street SW, Rochester, MN, USA
| | - Loretta Auvil
- National Center for Supercomputing Applications, 1205 W. Clark St., Urbana, IL, USA
| | - Humberto C Sasieta
- Mycobacterial and Bronchiectasis Clinic, Division of Pulmonary and Critical Care Medicine, Department of Medicine, Mayo Clinic, and Mayo Clinic Center for Tuberculosis, 200 First Street SW, Rochester, MN, USA
| | - Colleen Bushell
- Mayo-Illinois Alliance for Technology-Based Healthcare.,National Center for Supercomputing Applications, 1205 W. Clark St., Urbana, IL, USA
| | - Michael Welge
- Mayo-Illinois Alliance for Technology-Based Healthcare.,National Center for Supercomputing Applications, 1205 W. Clark St., Urbana, IL, USA
| | - Ryan C Bailey
- Department of Chemistry, University of Illinois at Urbana-Champaign, 600 South Mathews Avenue, Urbana, IL, USA.,Mayo-Illinois Alliance for Technology-Based Healthcare.,Department of Chemistry, University of Michigan, 930 North University Avenue, Ann Arbor, MI, USA
| |
Collapse
|
11
|
Schachtschneider KM, Welge ME, Auvil LS, Chaki S, Rund LA, Madsen O, Elmore MR, Johnson RW, Groenen MA, Schook LB. Altered Hippocampal Epigenetic Regulation Underlying Reduced Cognitive Development in Response to Early Life Environmental Insults. Genes (Basel) 2020; 11:genes11020162. [PMID: 32033187 PMCID: PMC7074491 DOI: 10.3390/genes11020162] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Revised: 01/30/2020] [Accepted: 02/01/2020] [Indexed: 12/13/2022] Open
Abstract
The hippocampus is involved in learning and memory and undergoes significant growth and maturation during the neonatal period. Environmental insults during this developmental timeframe can have lasting effects on brain structure and function. This study assessed hippocampal DNA methylation and gene transcription from two independent studies reporting reduced cognitive development stemming from early life environmental insults (iron deficiency and porcine reproductive and respiratory syndrome virus (PRRSv) infection) using porcine biomedical models. In total, 420 differentially expressed genes (DEGs) were identified between the reduced cognition and control groups, including genes involved in neurodevelopment and function. Gene ontology (GO) terms enriched for DEGs were associated with immune responses, angiogenesis, and cellular development. In addition, 116 differentially methylated regions (DMRs) were identified, which overlapped 125 genes. While no GO terms were enriched for genes overlapping DMRs, many of these genes are known to be involved in neurodevelopment and function, angiogenesis, and immunity. The observed altered methylation and expression of genes involved in neurological function suggest reduced cognition in response to early life environmental insults is due to altered cholinergic signaling and calcium regulation. Finally, two DMRs overlapped with two DEGs, VWF and LRRC32, which are associated with blood brain barrier permeability and regulatory T-cell activation, respectively. These results support the role of altered hippocampal DNA methylation and gene expression in early life environmentally-induced reductions in cognitive development across independent studies.
Collapse
Affiliation(s)
- Kyle M. Schachtschneider
- Department of Radiology, University of Illinois at Chicago, Chicago, IL 60607, USA;
- Department of Biochemistry and Molecular Genetics, University of Illinois at Chicago, Chicago, IL 60607, USA
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
| | - Michael E. Welge
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
| | - Loretta S. Auvil
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
| | - Sulalita Chaki
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
| | - Laurie A. Rund
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
| | - Ole Madsen
- Animal Breeding and Genomics, Wageningen University, 6708 Wageningen, The Netherlands; (O.M.); (M.A.M.G.)
| | - Monica R.P. Elmore
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
| | - Rodney W. Johnson
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
| | - Martien A.M. Groenen
- Animal Breeding and Genomics, Wageningen University, 6708 Wageningen, The Netherlands; (O.M.); (M.A.M.G.)
| | - Lawrence B. Schook
- Department of Radiology, University of Illinois at Chicago, Chicago, IL 60607, USA;
- National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA; (M.E.W.); (L.S.A.)
- Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, IL 616280, USA; (S.C.); (L.A.R.); (M.R.P.E.); (R.W.J.)
- Correspondence:
| |
Collapse
|
12
|
Valdés MG, Galván-Femenía I, Ripoll VR, Duran X, Yokota J, Gavaldà R, Rafael-Palou X, de Cid R. Pipeline design to identify key features and classify the chemotherapy response on lung cancer patients using large-scale genetic data. BMC SYSTEMS BIOLOGY 2018; 12:97. [PMID: 30458782 PMCID: PMC6245589 DOI: 10.1186/s12918-018-0615-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
BACKGROUND During the last decade, the interest to apply machine learning algorithms to genomic data has increased in many bioinformatics applications. Analyzing this type of data entails difficulties for managing high-dimensional data, class imbalance for knowledge extraction, identifying important features and classifying individuals. In this study, we propose a general framework to tackle these challenges with different machine learning algorithms and techniques. We apply the configuration of this framework on lung cancer patients, identifying genetic signatures for classifying response to drug treatment response. We intersect these relevant SNPs with the GWAS Catalog of the National Human Genome Research Institute and explore the Regulomedb, GTEx databases for functional analysis purposes. RESULTS The machine learning based solution proposed in this study is a scalable and flexible alternative to the classical uni-variate regression approach to analyze large-scale data. From 36 experiments executed using the machine learning framework design, we obtain good classification performance from the top 5 models with the highest cross-validation score and the smallest standard deviation. One thousand two hundred twenty four SNPs corresponding to the key features from the top 20 models (cross validation F1 mean >= 0.65) were compared with the GWAS Catalog finding no intersection with genome-wide significant reported hits. From these, new genetic signatures in MAE, CEP104, PRKCZ and ADRB2 show relevant biological regulatory functionality related to lung physiology. CONCLUSIONS We have defined a machine learning framework using data with an unbalanced large data-set of SNP-arrays and imputed genotyping data from a pharmacogenomics study in lung cancer patients subjected to first-line platinum-based treatment. This approach found genome signals with no genome-wide significance in the uni-variate regression approach (GWAS Catalog) that are valuable for classifying patients, only few of them with related biological function. The effect results of these variants can be explained by the recently proposed omnigenic model hypothesis, which states that complex traits can be influenced mostly by genes outside not only by the "core genes", mainly found by the genome-wide significant SNPs, but also by the rest of genes outside of the "core pathways" with apparent unrelated biological functionality.
Collapse
Affiliation(s)
- María Gabriela Valdés
- Eurecat. Technology Centre of Catalonia, Av. Diagonal 177, 9th floor, Barcelona, 08018 Spain
| | - Iván Galván-Femenía
- PMPPC-IGTP. Programa de Medicina Predictiva i Personalitzada del Càncer - Institut Germans Trias i Pujol (IGTP). Genomes for Life - GCAT lab Group, Badalona, Spain
| | - Vicent Ribas Ripoll
- Eurecat. Technology Centre of Catalonia, Av. Diagonal 177, 9th floor, Barcelona, 08018 Spain
| | - Xavier Duran
- PMPPC-IGTP. Programa de Medicina Predictiva i Personalitzada del Càncer - Institut Germans Trias i Pujol (IGTP). Genomes for Life - GCAT lab Group, Badalona, Spain
| | - Jun Yokota
- PMPPC-IGTP. Programa de Medicina Predictiva i Personalitzada del Càncer - Institut Germans Trias i Pujol (IGTP). CancerGenome Biology, Badalona, Spain
| | - Ricard Gavaldà
- Universitat Politècnica de Catalunya, Barcelona, Spain
- Barcelona Graduate School of Mathematics, BGSMath, Barcelona, Spain
| | - Xavier Rafael-Palou
- Eurecat. Technology Centre of Catalonia, Av. Diagonal 177, 9th floor, Barcelona, 08018 Spain
| | - Rafael de Cid
- PMPPC-IGTP. Programa de Medicina Predictiva i Personalitzada del Càncer - Institut Germans Trias i Pujol (IGTP). Genomes for Life - GCAT lab Group, Badalona, Spain
| |
Collapse
|
13
|
Association of specific gene mutations derived from machine learning with survival in lung adenocarcinoma. PLoS One 2018; 13:e0207204. [PMID: 30419062 PMCID: PMC6231670 DOI: 10.1371/journal.pone.0207204] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Accepted: 10/27/2018] [Indexed: 12/20/2022] Open
Abstract
Lung cancer is the second most common cancer in the United States and the leading cause of mortality in cancer patients. Biomarkers predicting survival of patients with lung cancer have a profound effect on patient prognosis and treatment. However, predictive biomarkers for survival and their relevance for lung cancer are not been well known yet. The objective of this study was to perform machine learning with data from The Cancer Genome Atlas of patients with lung adenocarcinoma (LUAD) to find survival-specific gene mutations that could be used as survival-predicting biomarkers. To identify survival-specific mutations according to various clinical factors, four feature selection methods (information gain, chi-squared test, minimum redundancy maximum relevance, and correlation) were used. Extracted survival-specific mutations of LUAD were applied individually or as a group for Kaplan-Meier survival analysis. Mutations in MMRN2 and GMPPA were significantly associated with patient mortality while those in ZNF560 and SETX were associated with patient survival. Mutations in DNAJC2 and MMRN2 showed significant negative association with overall survival while mutations in ZNF560 showed significant positive association with overall survival. Mutations in MMRN2 showed significant negative association with disease-free survival while mutations in DRD3 and ZNF560 showed positive associated with disease-free survival. Mutations in DRD3, SETX, and ZNF560 showed significant positive association with survival in patients with LUAD while the opposite was true for mutations in DNAJC2, GMPPA, and MMRN2. These gene mutations were also found in other cohorts of LUAD, lung squamous cell carcinoma, and small cell lung cancer. In LUAD of Pan-Lung Cancer cohort, mutations in GMPPA, DNAJC2, and MMRN2 showed significant negative associations with survival of patients while mutations in DRD3 and SETX showed significant positive association with survival. In this study, machine learning was conducted to obtain information necessary to discover specific gene mutations associated with the survival of patients with LUAD. Mutations in the above six genes could predict survival rate and disease-free survival rate in patients with LUAD. Thus, they are important biomarker candidates for prognosis.
Collapse
|
14
|
Mousavian M, Chen J, Greening S. Feature Selection and Imbalanced Data Handling for Depression Detection. Brain Inform 2018. [DOI: 10.1007/978-3-030-05587-5_33] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
15
|
Optimal and Novel Hybrid Feature Selection Framework for Effective Data Classification. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/978-981-10-4762-6_48] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
|
16
|
Paiva JS, Cardoso J, Pereira T. Supervised learning methods for pathological arterial pulse wave differentiation: A SVM and neural networks approach. Int J Med Inform 2017; 109:30-38. [PMID: 29195703 DOI: 10.1016/j.ijmedinf.2017.10.011] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2017] [Revised: 09/29/2017] [Accepted: 10/16/2017] [Indexed: 11/17/2022]
Abstract
OBJECTIVE The main goal of this study was to develop an automatic method based on supervised learning methods, able to distinguish healthy from pathologic arterial pulse wave (APW), and those two from noisy waveforms (non-relevant segments of the signal), from the data acquired during a clinical examination with a novel optical system. MATERIALS AND METHODS The APW dataset analysed was composed by signals acquired in a clinical environment from a total of 213 subjects, including healthy volunteers and non-healthy patients. The signals were parameterised by means of 39pulse features: morphologic, time domain statistics, cross-correlation features, wavelet features. Multiclass Support Vector Machine Recursive Feature Elimination (SVM RFE) method was used to select the most relevant features. A comparative study was performed in order to evaluate the performance of the two classifiers: Support Vector Machine (SVM) and Artificial Neural Network (ANN). RESULTS AND DISCUSSION SVM achieved a statistically significant better performance for this problem with an average accuracy of 0.9917±0.0024 and a F-Measure of 0.9925±0.0019, in comparison with ANN, which reached the values of 0.9847±0.0032 and 0.9852±0.0031 for Accuracy and F-Measure, respectively. A significant difference was observed between the performances obtained with SVM classifier using a different number of features from the original set available. CONCLUSION The comparison between SVM and NN allowed reassert the higher performance of SVM. The results obtained in this study showed the potential of the proposed method to differentiate those three important signal outcomes (healthy, pathologic and noise) and to reduce bias associated with clinical diagnosis of cardiovascular disease using APW.
Collapse
Affiliation(s)
- Joana S Paiva
- Institute for Systems and Computer Engineering, Technology and Science (INESC TEC), Rua Dr. Roberto Frias, 4200, Porto, Portugal; Physics and Astronomy Department, Sciences Faculty, University of Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal
| | - João Cardoso
- LIBPhys-UC, Physics Department, University of Coimbra, Rua Larga, 3004-516 Coimbra, Portugal
| | - Tânia Pereira
- LIBPhys-UC, Physics Department, University of Coimbra, Rua Larga, 3004-516 Coimbra, Portugal.
| |
Collapse
|
17
|
Frères P, Wenric S, Boukerroucha M, Fasquelle C, Thiry J, Bovy N, Struman I, Geurts P, Collignon J, Schroeder H, Kridelka F, Lifrange E, Jossa V, Bours V, Josse C, Jerusalem G. Circulating microRNA-based screening tool for breast cancer. Oncotarget 2016; 7:5416-28. [PMID: 26734993 PMCID: PMC4868695 DOI: 10.18632/oncotarget.6786] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 12/05/2015] [Indexed: 12/20/2022] Open
Abstract
Circulating microRNAs (miRNAs) are increasingly recognized as powerful biomarkers in several pathologies, including breast cancer. Here, their plasmatic levels were measured to be used as an alternative screening procedure to mammography for breast cancer diagnosis. A plasma miRNA profile was determined by RT-qPCR in a cohort of 378 women. A diagnostic model was designed based on the expression of 8 miRNAs measured first in a profiling cohort composed of 41 primary breast cancers and 45 controls, and further validated in diverse cohorts composed of 108 primary breast cancers, 88 controls, 35 breast cancers in remission, 31 metastatic breast cancers and 30 gynecologic tumors. A receiver operating characteristic curve derived from the 8-miRNA random forest based diagnostic tool exhibited an area under the curve of 0.81. The accuracy of the diagnostic tool remained unchanged considering age and tumor stage. The miRNA signature correctly identified patients with metastatic breast cancer. The use of the classification model on cohorts of patients with breast cancers in remission and with gynecologic cancers yielded prediction distributions similar to that of the control group. Using a multivariate supervised learning method and a set of 8 circulating miRNAs, we designed an accurate, minimally invasive screening tool for breast cancer.
Collapse
Affiliation(s)
- Pierre Frères
- University Hospital (CHU), Department of Medical Oncology, Liège, Belgium.,University of Liège, GIGA-Research, Laboratory of Human Genetics, Liège, Belgium
| | - Stéphane Wenric
- University of Liège, GIGA-Research, Laboratory of Human Genetics, Liège, Belgium
| | - Meriem Boukerroucha
- University of Liège, GIGA-Research, Laboratory of Human Genetics, Liège, Belgium
| | - Corinne Fasquelle
- University of Liège, GIGA-Research, Laboratory of Human Genetics, Liège, Belgium
| | - Jérôme Thiry
- University of Liège, GIGA-Research, Laboratory of Human Genetics, Liège, Belgium
| | - Nicolas Bovy
- University of Liège, GIGA-Research, Laboratory of Molecular Angiogenesis, Liège, Belgium
| | - Ingrid Struman
- University of Liège, GIGA-Research, Laboratory of Molecular Angiogenesis, Liège, Belgium
| | - Pierre Geurts
- University of Liège, GIGA-Research, Department of EE and CS, Liège, Belgium
| | - Joëlle Collignon
- University Hospital (CHU), Department of Medical Oncology, Liège, Belgium
| | - Hélène Schroeder
- University Hospital (CHU), Department of Medical Oncology, Liège, Belgium
| | | | - Eric Lifrange
- University Hospital (CHU), Department of Senology, Liège, Belgium
| | - Véronique Jossa
- Clinique Saint-Vincent (CHC), Department of Pathology, Liège, Belgium
| | - Vincent Bours
- University of Liège, GIGA-Research, Laboratory of Human Genetics, Liège, Belgium
| | - Claire Josse
- University of Liège, GIGA-Research, Laboratory of Human Genetics, Liège, Belgium
| | - Guy Jerusalem
- University Hospital (CHU), Department of Medical Oncology, Liège, Belgium
| |
Collapse
|
18
|
Liu Y, Balagurunathan Y, Atwater T, Antic S, Li Q, Walker RC, Smith GT, Massion PP, Schabath MB, Gillies RJ. Radiological Image Traits Predictive of Cancer Status in Pulmonary Nodules. Clin Cancer Res 2016; 23:1442-1449. [PMID: 27663588 DOI: 10.1158/1078-0432.ccr-15-3102] [Citation(s) in RCA: 69] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2016] [Revised: 07/29/2016] [Accepted: 08/17/2016] [Indexed: 12/30/2022]
Abstract
Purpose: We propose a systematic methodology to quantify incidentally identified pulmonary nodules based on observed radiological traits (semantics) quantified on a point scale and a machine-learning method using these data to predict cancer status.Experimental Design: We investigated 172 patients who had low-dose CT images, with 102 and 70 patients grouped into training and validation cohorts, respectively. On the images, 24 radiological traits were systematically scored and a linear classifier was built to relate the traits to malignant status. The model was formed both with and without size descriptors to remove bias due to nodule size. The multivariate pairs formed on the training set were tested on an independent validation data set to evaluate their performance.Results: The best 4-feature set that included a size measurement (set 1), was short axis, contour, concavity, and texture, which had an area under the receiver operator characteristic curve (AUROC) of 0.88 (accuracy = 81%, sensitivity = 76.2%, specificity = 91.7%). If size measures were excluded, the four best features (set 2) were location, fissure attachment, lobulation, and spiculation, which had an AUROC of 0.83 (accuracy = 73.2%, sensitivity = 73.8%, specificity = 81.7%) in predicting malignancy in primary nodules. The validation test AUROC was 0.8 (accuracy = 74.3%, sensitivity = 66.7%, specificity = 75.6%) and 0.74 (accuracy = 71.4%, sensitivity = 61.9%, specificity = 75.5%) for sets 1 and 2, respectively.Conclusions: Radiological image traits are useful in predicting malignancy in lung nodules. These semantic traits can be used in combination with size-based measures to enhance prediction accuracy and reduce false-positives. Clin Cancer Res; 23(6); 1442-9. ©2016 AACR.
Collapse
Affiliation(s)
- Ying Liu
- Department of Radiology, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center of Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin, China.,Cancer Imaging and Metabolism, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida
| | - Yoganand Balagurunathan
- Cancer Imaging and Metabolism, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida
| | - Thomas Atwater
- Thoracic Program, Vanderbilt-Ingram Comprehensive Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee
| | - Sanja Antic
- Thoracic Program, Vanderbilt-Ingram Comprehensive Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee
| | - Qian Li
- Department of Radiology, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center of Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin, China.,Cancer Imaging and Metabolism, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida
| | - Ronald C Walker
- Thoracic Program, Vanderbilt-Ingram Comprehensive Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee.,Department of Radiology, Vanderbilt University School of Medicine, Nashville, Tennessee.,Veterans Affairs Medical Center, Nashville, Tennessee
| | - Gary T Smith
- Department of Radiology, Vanderbilt University School of Medicine, Nashville, Tennessee.,Veterans Affairs Medical Center, Nashville, Tennessee
| | - Pierre P Massion
- Thoracic Program, Vanderbilt-Ingram Comprehensive Cancer Center, Vanderbilt University School of Medicine, Nashville, Tennessee.,Department of Radiology, Vanderbilt University School of Medicine, Nashville, Tennessee.,Veterans Affairs Medical Center, Nashville, Tennessee
| | - Matthew B Schabath
- Cancer Epidemiology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida
| | - Robert J Gillies
- Cancer Imaging and Metabolism, H. Lee Moffitt Cancer Center and Research Institute, Tampa, Florida.
| |
Collapse
|
19
|
Nayyeri M, Sharifi Noghabi H. Cancer classification by correntropy-based sparse compact incremental learning machine. GENE REPORTS 2016. [DOI: 10.1016/j.genrep.2016.01.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
20
|
Mohammadi M, Sharifi Noghabi H, Abed Hodtani G, Rajabi Mashhadi H. Robust and stable gene selection via Maximum–Minimum Correntropy Criterion. Genomics 2016; 107:83-87. [DOI: 10.1016/j.ygeno.2015.12.006] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2015] [Revised: 12/13/2015] [Accepted: 12/23/2015] [Indexed: 11/17/2022]
|
21
|
Madahian B, Roy S, Bowman D, Deng LY, Homayouni R. A Bayesian approach for inducing sparsity in generalized linear models with multi-category response. BMC Bioinformatics 2015; 16 Suppl 13:S13. [PMID: 26423345 PMCID: PMC4597416 DOI: 10.1186/1471-2105-16-s13-s13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND The dimension and complexity of high-throughput gene expression data create many challenges for downstream analysis. Several approaches exist to reduce the number of variables with respect to small sample sizes. In this study, we utilized the Generalized Double Pareto (GDP) prior to induce sparsity in a Bayesian Generalized Linear Model (GLM) setting. The approach was evaluated using a publicly available microarray dataset containing 99 samples corresponding to four different prostate cancer subtypes. RESULTS A hierarchical Sparse Bayesian GLM using GDP prior (SBGG) was developed to take into account the progressive nature of the response variable. We obtained an average overall classification accuracy between 82.5% and 94%, which was higher than Support Vector Machine, Random Forest or a Sparse Bayesian GLM using double exponential priors. Additionally, SBGG outperforms the other 3 methods in correctly identifying pre-metastatic stages of cancer progression, which can prove extremely valuable for therapeutic and diagnostic purposes. Importantly, using Geneset Cohesion Analysis Tool, we found that the top 100 genes produced by SBGG had an average functional cohesion p-value of 2.0E-4 compared to 0.007 to 0.131 produced by the other methods. CONCLUSIONS Using GDP in a Bayesian GLM model applied to cancer progression data results in better subclass prediction. In particular, the method identifies pre-metastatic stages of prostate cancer with substantially better accuracy and produces more functionally relevant gene sets.
Collapse
|