1
|
Garbulowski M, Diamanti K, Smolińska K, Baltzer N, Stoll P, Bornelöv S, Øhrn A, Feuk L, Komorowski J. R.ROSETTA: an interpretable machine learning framework. BMC Bioinformatics 2021; 22:110. [PMID: 33676405 PMCID: PMC7937228 DOI: 10.1186/s12859-021-04049-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 02/24/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Machine learning involves strategies and algorithms that may assist bioinformatics analyses in terms of data mining and knowledge discovery. In several applications, viz. in Life Sciences, it is often more important to understand how a prediction was obtained rather than knowing what prediction was made. To this end so-called interpretable machine learning has been recently advocated. In this study, we implemented an interpretable machine learning package based on the rough set theory. An important aim of our work was provision of statistical properties of the models and their components. RESULTS We present the R.ROSETTA package, which is an R wrapper of ROSETTA framework. The original ROSETTA functions have been improved and adapted to the R programming environment. The package allows for building and analyzing non-linear interpretable machine learning models. R.ROSETTA gathers combinatorial statistics via rule-based modelling for accessible and transparent results, well-suited for adoption within the greater scientific community. The package also provides statistics and visualization tools that facilitate minimization of analysis bias and noise. The R.ROSETTA package is freely available at https://github.com/komorowskilab/R.ROSETTA . To illustrate the usage of the package, we applied it to a transcriptome dataset from an autism case-control study. Our tool provided hypotheses for potential co-predictive mechanisms among features that discerned phenotype classes. These co-predictors represented neurodevelopmental and autism-related genes. CONCLUSIONS R.ROSETTA provides new insights for interpretable machine learning analyses and knowledge-based systems. We demonstrated that our package facilitated detection of dependencies for autism-related genes. Although the sample application of R.ROSETTA illustrates transcriptome data analysis, the package can be used to analyze any data organized in decision tables.
Collapse
Affiliation(s)
- Mateusz Garbulowski
- Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
| | - Klev Diamanti
- Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Karolina Smolińska
- Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
| | - Nicholas Baltzer
- Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
- Department of Research, Cancer Registry of Norway, Oslo, Norway
| | - Patricia Stoll
- Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
- Department of Biosystems Science and Engineering, ETH Zurich, Zurich, Switzerland
| | - Susanne Bornelöv
- Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden
- Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK
| | | | - Lars Feuk
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala, Sweden
| | - Jan Komorowski
- Department of Cell and Molecular Biology, Uppsala University, Uppsala, Sweden.
- Swedish Collegium for Advanced Study, Uppsala, Sweden.
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland.
- Washington National Primate Research Center, Seattle, WA, USA.
| |
Collapse
|
2
|
Unveiling new interdependencies between significant DNA methylation sites, gene expression profiles and glioma patients survival. Sci Rep 2018. [PMID: 29535343 PMCID: PMC5849697 DOI: 10.1038/s41598-018-22829-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
In order to find clinically useful prognostic markers for glioma patients’ survival, we employed Monte Carlo Feature Selection and Interdependencies Discovery (MCFS-ID) algorithm on DNA methylation (HumanMethylation450 platform) and RNA-seq datasets from The Cancer Genome Atlas (TCGA) for 88 patients observed until death. The input features were ranked according to their importance in predicting patients’ longer (400+ days) or shorter (≤400 days) survival without prior classification of the patients. Interestingly, out of the 65 most important features found, 63 are methylation sites, and only two mRNAs. Moreover, 61 out of the 63 methylation sites are among those detected by the 450 k array technology, while being absent in the HumanMethylation27. The most important methylation feature (cg15072976) overlaps with the RE1 Silencing Transcription Factor (REST) binding site, and was confirmed to intersect with the REST binding motif in human U87 glioma cells. Six additional methylation sites from the top 63 overlap with REST sites. We found that the methylation status of the cg15072976 site affects transcription factor binding in U87 cells in gel shift assay. The cg15072976 methylation status discriminates ≤400 and 400+ patients in an independent dataset from TCGA and shows positive association with survival time as evidenced by Kaplan-Meier plots.
Collapse
|
3
|
Chen L, Li J, Zhang YH, Feng K, Wang S, Zhang Y, Huang T, Kong X, Cai YD. Identification of gene expression signatures across different types of neural stem cells with the Monte-Carlo feature selection method. J Cell Biochem 2017; 119:3394-3403. [PMID: 29130544 DOI: 10.1002/jcb.26507] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2017] [Accepted: 11/09/2017] [Indexed: 02/03/2023]
Abstract
Adult neural stem cells (NSCs) are a group of multi-potent, self-renewing progenitor cells that contribute to the generation of new neurons and oligodendrocytes. Three subtypes of NSCs can be isolated based on the stages of the NSC lineage, including quiescent neural stem cells (qNSCs), activated neural stem cells (aNSCs) and neural progenitor cells (NPCs). Although it is widely accepted that these three groups of NSCs play different roles in the development of the nervous system, their molecular signatures are poorly understood. In this study, we applied the Monte-Carlo Feature Selection (MCFS) method to identify the gene expression signatures, which can yield a Matthews correlation coefficient (MCC) value of 0.918 with a support vector machine evaluated by ten-fold cross-validation. In addition, some classification rules yielded by the MCFS program for distinguishing above three subtypes were reported. Our results not only demonstrate a high classification capacity and subtype-specific gene expression patterns but also quantitatively reflect the pattern of the gene expression levels across the NSC lineage, providing insight into deciphering the molecular basis of NSC differentiation.
Collapse
Affiliation(s)
- Lei Chen
- Schoolof Life Sciences, Shanghai University, Shanghai, P.R. China.,College of Information Engineering, Shanghai Maritime University, Shanghai, P.R. China
| | - JiaRui Li
- Schoolof Life Sciences, Shanghai University, Shanghai, P.R. China
| | - Yu-Hang Zhang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
| | - KaiYan Feng
- Department of Computer Science, Guangdong AIB Polytechnic, Guangzhou, Guangdong, P.R. China
| | - ShaoPeng Wang
- Schoolof Life Sciences, Shanghai University, Shanghai, P.R. China
| | - YunHua Zhang
- Anhui province key lab of Farmland Ecological Conversation and Pollution Prevention, School of Resources and Environment, Anhui Agricultural University, Hefei, P.R. China
| | - Tao Huang
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
| | - Xiangyin Kong
- Institute of Health Sciences, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
| | - Yu-Dong Cai
- Schoolof Life Sciences, Shanghai University, Shanghai, P.R. China
| |
Collapse
|
4
|
Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data. J Intell Inf Syst 2017. [DOI: 10.1007/s10844-017-0446-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|