1
|
Furxhi I, Bengalli R, Motta G, Mantecca P, Kose O, Carriere M, Haq EU, O’Mahony C, Blosi M, Gardini D, Costa A. Data-Driven Quantitative Intrinsic Hazard Criteria for Nanoproduct Development in a Safe-by-Design Paradigm: A Case Study of Silver Nanoforms. ACS APPLIED NANO MATERIALS 2023; 6:3948-3962. [PMID: 36938492 PMCID: PMC10012170 DOI: 10.1021/acsanm.3c00173] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 01/20/2023] [Indexed: 06/18/2023]
Abstract
The current European (EU) policies, that is, the Green Deal, envisage safe and sustainable practices for chemicals, which include nanoforms (NFs), at the earliest stages of innovation. A theoretically safe and sustainable by design (SSbD) framework has been established from EU collaborative efforts toward the definition of quantitative criteria in each SSbD dimension, namely, the human and environmental safety dimension and the environmental, social, and economic sustainability dimensions. In this study, we target the safety dimension, and we demonstrate the journey toward quantitative intrinsic hazard criteria derived from findable, accessible, interoperable, and reusable data. Data were curated and merged for the development of new approach methodologies, that is, quantitative structure-activity relationship models based on regression and classification machine learning algorithms, with the intent to predict a hazard class. The models utilize system (i.e., hydrodynamic size and polydispersity index) and non-system (i.e., elemental composition and core size)-dependent nanoscale features in combination with biological in vitro attributes and experimental conditions for various silver NFs, functional antimicrobial textiles, and cosmetics applications. In a second step, interpretable rules (criteria) followed by a certainty factor were obtained by exploiting a Bayesian network structure crafted by expert reasoning. The probabilistic model shows a predictive capability of ≈78% (average accuracy across all hazard classes). In this work, we show how we shifted from the conceptualization of the SSbD framework toward the realistic implementation with pragmatic instances. This study reveals (i) quantitative intrinsic hazard criteria to be considered in the safety aspects during synthesis stage, (ii) the challenges within, and (iii) the future directions for the generation and distillation of such criteria that can feed SSbD paradigms. Specifically, the criteria can guide material engineers to synthesize NFs that are inherently safer from alternative nanoformulations, at the earliest stages of innovation, while the models enable a fast and cost-efficient in silico toxicological screening of previously synthesized and hypothetical scenarios of yet-to-be synthesized NFs.
Collapse
Affiliation(s)
- Irini Furxhi
- Transgero
Ltd, Limerick V42V384, Ireland
- Department
of Accounting and Finance, Kemmy Business School, University of Limerick, Limerick V94T9PX, Ireland
| | - Rossella Bengalli
- Department
of Earth and Environmental Sciences, University
of Milano-Bicocca, Piazza
della Scienza 1, Milano 20126, Italy
| | - Giulia Motta
- Department
of Earth and Environmental Sciences, University
of Milano-Bicocca, Piazza
della Scienza 1, Milano 20126, Italy
| | - Paride Mantecca
- Department
of Earth and Environmental Sciences, University
of Milano-Bicocca, Piazza
della Scienza 1, Milano 20126, Italy
| | - Ozge Kose
- Univ.
Grenoble Alpes, CEA, CNRS, Grenoble INP, IRIG, SYMMES, Grenoble 38000, France
| | - Marie Carriere
- Univ.
Grenoble Alpes, CEA, CNRS, Grenoble INP, IRIG, SYMMES, Grenoble 38000, France
| | - Ehtsham Ul Haq
- Department
of Physics, and Bernal Institute, University
of Limerick, Limerick V94TC9PX, Ireland
| | - Charlie O’Mahony
- Department
of Physics, and Bernal Institute, University
of Limerick, Limerick V94TC9PX, Ireland
| | - Magda Blosi
- Istituto
di Scienza e Tecnologia dei Materiali Ceramici (CNR-ISTEC), Via Granarolo, 64, Faenza 48018, Ravenna, Italy
| | - Davide Gardini
- Istituto
di Scienza e Tecnologia dei Materiali Ceramici (CNR-ISTEC), Via Granarolo, 64, Faenza 48018, Ravenna, Italy
| | - Anna Costa
- Istituto
di Scienza e Tecnologia dei Materiali Ceramici (CNR-ISTEC), Via Granarolo, 64, Faenza 48018, Ravenna, Italy
| |
Collapse
|
2
|
Evolutionary Algorithm for Improving Decision Tree with Global Discretization in Manufacturing. SENSORS 2021; 21:s21082849. [PMID: 33919558 PMCID: PMC8074051 DOI: 10.3390/s21082849] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 04/13/2021] [Accepted: 04/15/2021] [Indexed: 11/17/2022]
Abstract
Due to the recent advance in the industrial Internet of Things (IoT) in manufacturing, the vast amount of data from sensors has triggered the need for leveraging such big data for fault detection. In particular, interpretable machine learning techniques, such as tree-based algorithms, have drawn attention to the need to implement reliable manufacturing systems, and identify the root causes of faults. However, despite the high interpretability of decision trees, tree-based models make a trade-off between accuracy and interpretability. In order to improve the tree's performance while maintaining its interpretability, an evolutionary algorithm for discretization of multiple attributes, called Decision tree Improved by Multiple sPLits with Evolutionary algorithm for Discretization (DIMPLED), is proposed. The experimental results with two real-world datasets from sensors showed that the decision tree improved by DIMPLED outperformed the performances of single-decision-tree models (C4.5 and CART) that are widely used in practice, and it proved competitive compared to the ensemble methods, which have multiple decision trees. Even though the ensemble methods could produce slightly better performances, the proposed DIMPLED has a more interpretable structure, while maintaining an appropriate performance level.
Collapse
|
3
|
Esmaeilyfard R, Paknahad M, Dokohaki S. Sex classification of first molar teeth in cone beam computed tomography images using data mining. Forensic Sci Int 2020; 318:110633. [PMID: 33279763 DOI: 10.1016/j.forsciint.2020.110633] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 10/15/2020] [Accepted: 11/25/2020] [Indexed: 11/28/2022]
Abstract
OBJECTIVE The teeth have been used as a supplementary tool for sex differentiation as they are resistant to post-mortem degradation. The present study aimed to develop a new novel informatics framework for predicting sex from linear tooth dimension measurements achieved from cone beam computed tomography (CBCT) images. METHOD AND MATERIALS A clinical workflow using different machine learning methods was employed to predict the sex in the present study. The CBCT images of 485 subjects (245 men and 240 women) were evaluated for sex differentiation. Nine parameters were measured in both buccolingual and mesiodistal aspects of the teeth. We applied our dataset to Naïve Bayesian (NB), Random Forest (RF), and Support Vector Machine (SVM) as classifiers for prediction. Genetic feature selection was used to discover real features associated with sex classification. RESULTS The 10-fold cross-validation results indicated that NB had higher accuracy than SVM and RF for sex classification. The genetic algorithm (GA) indicated that the model could fit the data without using the enamel thickness and pulp height. The average classification accuracy of our clinical workflow was 92.31 %. CONCLUSION The results showed that NB was the best method for sex classification. The application of the first molar teeth in sex prediction indicated an acceptable level of sexual classification. Therefore, these odontometric parameters can be applied as an additional tool for sex determination in forensic anthropology.
Collapse
Affiliation(s)
- Rasool Esmaeilyfard
- Computer Engineering and Information Technology Department, Shiraz University of Technology, Shiraz, Iran
| | - Maryam Paknahad
- Oral and Dental Disease Research Center, Oral and Maxillofacial Radiology Department, Dental School, Shiraz University of Medical Sciences, Shiraz, Iran.
| | - Sonia Dokohaki
- Oral and Maxillofacial Radiology Department, Dental School, Shiraz University of Medical Sciences, Shiraz, Iran
| |
Collapse
|
4
|
Li Y, Jann T, Vera-Licona P. Benchmarking time-series data discretization on inference methods. Bioinformatics 2019; 35:3102-3109. [PMID: 30657860 DOI: 10.1093/bioinformatics/btz036] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Revised: 12/10/2018] [Accepted: 01/14/2019] [Indexed: 12/15/2022] Open
Abstract
SUMMARY The rapid development in quantitatively measuring DNA, RNA and protein has generated a great interest in the development of reverse-engineering methods, that is, data-driven approaches to infer the network structure or dynamical model of the system. Many reverse-engineering methods require discrete quantitative data as input, while many experimental data are continuous. Some studies have started to reveal the impact that the choice of data discretization has on the performance of reverse-engineering methods. However, more comprehensive studies are still greatly needed to systematically and quantitatively understand the impact that discretization methods have on inference methods. Furthermore, there is an urgent need for systematic comparative methods that can help select between discretization methods. In this work, we consider four published intracellular networks inferred with their respective time-series datasets. We discretized the data using different discretization methods. Across all datasets, changing the data discretization to a more appropriate one improved the reverse-engineering methods' performance. We observed no universal best discretization method across different time-series datasets. Thus, we propose DiscreeTest, a two-step evaluation metric for ranking discretization methods for time-series data. The underlying assumption of DiscreeTest is that an optimal discretization method should preserve the dynamic patterns observed in the original data across all variables. We used the same datasets and networks to show that DiscreeTest is able to identify an appropriate discretization among several candidate methods. To our knowledge, this is the first time that a method for benchmarking and selecting an appropriate discretization method for time-series data has been proposed. AVAILABILITY AND IMPLEMENTATION All the datasets, reverse-engineering methods and source code used in this paper are available in Vera-Licona's lab Github repository: https://github.com/VeraLiconaResearchGroup/Benchmarking_TSDiscretizations. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuezhe Li
- R.D. Berlin Center for Cell Analysis and Modeling, University of Connecticut School of Medicine, Farmington, CT, USA
| | - Tiffany Jann
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Paola Vera-Licona
- Center for Quantitative Medicine, University of Connecticut School of Medicine, Farmington, CT, USA.,Department of Cell Biology, University of Connecticut School of Medicine, Farmington, CT, USA.,Department of Pediatrics, University of Connecticut School of Medicine, Farmington, CT, USA.,Institute for Systems Genomics, University of Connecticut School of Medicine, Farmington, CT, USA
| |
Collapse
|
5
|
Ray SS, Misra S. Genetic algorithm for assigning weights to gene expressions using functional annotations. Comput Biol Med 2018; 104:149-162. [PMID: 30472497 DOI: 10.1016/j.compbiomed.2018.11.011] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Revised: 11/13/2018] [Accepted: 11/13/2018] [Indexed: 12/17/2022]
Abstract
A method, named genetic algorithm for assigning weights to gene expressions using functional annotations (GAAWGEFA), is developed to assign proper weights to the gene expressions at each time point. The weights are estimated using functional annotations of the genes in a genetic algorithm framework. The method shows gene similarity in an improved manner as compared with other existing methods because it takes advantage of the existing functional annotations of the genes. The weight combination for the expressions at different time points is determined by maximizing the fitness function of GAAWGEFA in terms of the positive predictive value (PPV) for the top 10,000 gene pairs. The performance of the proposed method is primarily compared with Biweight mid correlation (BICOR) and original expression values for the six Saccharomyces cerevisiae datasets and one Bacillus subtilis dataset. The utility of GAAWGEFA is shown in predicting the functions of 48 unclassified genes (using p-value cutoff 10-13) from Saccharomyces cerevisiae microarray data where the expressions are weighted using GAAWGEFA and are clustered using k-medoids algorithm. The related code along with various parameters is available at http://sampa.droppages.com/GAAWGEFA.html.
Collapse
Affiliation(s)
- Shubhra Sankar Ray
- Machine Intelligence Unit, Indian Statistical Institute, 203 B.T. Road, Kolkata, 700108, India.
| | - Sampa Misra
- Machine Intelligence Unit, Indian Statistical Institute, 203 B.T. Road, Kolkata, 700108, India.
| |
Collapse
|
6
|
Balasubramanian JB, Gopalakrishnan V. Tunable structure priors for Bayesian rule learning for knowledge integrated biomarker discovery. World J Clin Oncol 2018; 9:98-109. [PMID: 30254965 PMCID: PMC6153126 DOI: 10.5306/wjco.v9.i5.98] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Revised: 07/24/2018] [Accepted: 08/05/2018] [Indexed: 02/06/2023] Open
Abstract
AIM To develop a framework to incorporate background domain knowledge into classification rule learning for knowledge discovery in biomedicine.
METHODS Bayesian rule learning (BRL) is a rule-based classifier that uses a greedy best-first search over a space of Bayesian belief-networks (BN) to find the optimal BN to explain the input dataset, and then infers classification rules from this BN. BRL uses a Bayesian score to evaluate the quality of BNs. In this paper, we extended the Bayesian score to include informative structure priors, which encodes our prior domain knowledge about the dataset. We call this extension of BRL as BRLp. The structure prior has a λ hyperparameter that allows the user to tune the degree of incorporation of the prior knowledge in the model learning process. We studied the effect of λ on model learning using a simulated dataset and a real-world lung cancer prognostic biomarker dataset, by measuring the degree of incorporation of our specified prior knowledge. We also monitored its effect on the model predictive performance. Finally, we compared BRLp to other state-of-the-art classifiers commonly used in biomedicine.
RESULTS We evaluated the degree of incorporation of prior knowledge into BRLp, with simulated data by measuring the Graph Edit Distance between the true data-generating model and the model learned by BRLp. We specified the true model using informative structure priors. We observed that by increasing the value of λ we were able to increase the influence of the specified structure priors on model learning. A large value of λ of BRLp caused it to return the true model. This also led to a gain in predictive performance measured by area under the receiver operator characteristic curve (AUC). We then obtained a publicly available real-world lung cancer prognostic biomarker dataset and specified a known biomarker from literature [the epidermal growth factor receptor (EGFR) gene]. We again observed that larger values of λ led to an increased incorporation of EGFR into the final BRLp model. This relevant background knowledge also led to a gain in AUC.
CONCLUSION BRLp enables tunable structure priors to be incorporated during Bayesian classification rule learning that integrates data and knowledge as demonstrated using lung cancer biomarker data.
Collapse
Affiliation(s)
- Jeya Balaji Balasubramanian
- Intelligent Systems Program, School of Computing and Information, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Vanathi Gopalakrishnan
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15206, United States
| |
Collapse
|
7
|
Sriwanna K, Boongoen T, Iam-On N. Graph clustering-based discretization approach to microarray data. Knowl Inf Syst 2018. [DOI: 10.1007/s10115-018-1249-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
8
|
Finding optimum width of discretization for gene expressions using functional annotations. Comput Biol Med 2017; 90:59-67. [DOI: 10.1016/j.compbiomed.2017.09.010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Revised: 09/14/2017] [Accepted: 09/14/2017] [Indexed: 12/20/2022]
|
9
|
Rajappan S, Rangasamy D. Estimation of incomplete values in heterogeneous attribute large datasets using discretized Bayesian max–min ant colony optimization. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1123-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
10
|
Liu Y, Gopalakrishnan V. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data. DATA 2017; 2. [PMID: 28243594 PMCID: PMC5325161 DOI: 10.3390/data2010008] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.
Collapse
Affiliation(s)
- Yuzhe Liu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, USA
- Medical Scientist Training Program, University of Pittsburgh, Pittsburgh, PA 15260, USA
- Correspondence:
| | - Vanathi Gopalakrishnan
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, USA
- Medical Scientist Training Program, University of Pittsburgh, Pittsburgh, PA 15260, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, USA
| |
Collapse
|
11
|
Learning Parsimonious Classification Rules from Gene Expression Data Using Bayesian Networks with Local Structure. DATA 2017; 2. [PMID: 28331847 PMCID: PMC5358670 DOI: 10.3390/data2010005] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The comprehensibility of good predictive models learned from high-dimensional gene expression data is attractive because it can lead to biomarker discovery. Several good classifiers provide comparable predictive performance but differ in their abilities to summarize the observed data. We extend a Bayesian Rule Learning (BRL-GSS) algorithm, previously shown to be a significantly better predictor than other classical approaches in this domain. It searches a space of Bayesian networks using a decision tree representation of its parameters with global constraints, and infers a set of IF-THEN rules. The number of parameters and therefore the number of rules are combinatorial to the number of predictor variables in the model. We relax these global constraints to a more generalizable local structure (BRL-LSS). BRL-LSS entails more parsimonious set of rules because it does not have to generate all combinatorial rules. The search space of local structures is much richer than the space of global structures. We design the BRL-LSS with the same worst-case time-complexity as BRL-GSS while exploring a richer and more complex model space. We measure predictive performance using Area Under the ROC curve (AUC) and Accuracy. We measure model parsimony performance by noting the average number of rules and variables needed to describe the observed data. We evaluate the predictive and parsimony performance of BRL-GSS, BRL-LSS and the state-of-the-art C4.5 decision tree algorithm, across 10-fold cross-validation using ten microarray gene-expression diagnostic datasets. In these experiments, we observe that BRL-LSS is similar to BRL-GSS in terms of predictive performance, while generating a much more parsimonious set of rules to explain the same observed data. BRL-LSS also needs fewer variables than C4.5 to explain the data with similar predictive performance. We also conduct a feasibility study to demonstrate the general applicability of our BRL methods on the newer RNA sequencing gene-expression data.
Collapse
|
12
|
Gallo CA, Cecchini RL, Carballido JA, Micheletto S, Ponzoni I. Discretization of gene expression data revised. Brief Bioinform 2015; 17:758-70. [PMID: 26438418 DOI: 10.1093/bib/bbv074] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2015] [Indexed: 01/22/2023] Open
Abstract
Gene expression measurements represent the most important source of biological data used to unveil the interaction and functionality of genes. In this regard, several data mining and machine learning algorithms have been proposed that require, in a number of cases, some kind of data discretization to perform the inference. Selection of an appropriate discretization process has a major impact on the design and outcome of the inference algorithms, as there are a number of relevant issues that need to be considered. This study presents a revision of the current state-of-the-art discretization techniques, together with the key subjects that need to be considered when designing or selecting a discretization approach for gene expression data.
Collapse
|
13
|
Gopalakrishnan V, Menon PG, Madan S. cMRI-BED: A novel informatics framework for cardiac MRI biomarker extraction and discovery applied to pediatric cardiomyopathy classification. Biomed Eng Online 2015; 14 Suppl 2:S7. [PMID: 26329721 PMCID: PMC4547147 DOI: 10.1186/1475-925x-14-s2-s7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Background Pediatric cardiomyopathies are a rare, yet heterogeneous group of pathologies of the myocardium that are routinely examined clinically using Cardiovascular Magnetic Resonance Imaging (cMRI). This gold standard powerful non-invasive tool yields high resolution temporal images that characterize myocardial tissue. The complexities associated with the annotation of images and extraction of markers, necessitate the development of efficient workflows to acquire, manage and transform this data into actionable knowledge for patient care to reduce mortality and morbidity. Methods We develop and test a novel informatics framework called cMRI-BED for biomarker extraction and discovery from such complex pediatric cMRI data that includes the use of a suite of tools for image processing, marker extraction and predictive modeling. We applied our workflow to obtain and analyze a dataset of 83 de-identified cases and controls containing cMRI-derived biomarkers for classifying positive versus negative findings of cardiomyopathy in children. Bayesian rule learning (BRL) methods were applied to derive understandable models in the form of propositional rules with posterior probabilities pertaining to their validity. Popular machine learning methods in the WEKA data mining toolkit were applied using default parameters to assess cross-validation performance of this dataset using accuracy and percentage area under ROC curve (AUC) measures. Results The best 10-fold cross validation predictive performance obtained on this cMRI-derived biomarker dataset was 80.72% accuracy and 79.6% AUC by a BRL decision tree model, which is promising from this type of rare data. Moreover, we were able to verify that mycocardial delayed enhancement (MDE) status, which is known to be an important qualitative factor in the classification of cardiomyopathies, is picked up by our rule models as an important variable for prediction. Conclusions Preliminary results show the feasibility of our framework for processing such data while also yielding actionable predictive classification rules that can augment knowledge conveyed in cardiac radiology outcome reports. Interactions between MDE status and other cMRI parameters that are depicted in our rules warrant further investigation and validation. Predictive rules learned from cMRI data to classify positive and negative findings of cardiomyopathy can enhance scientific understanding of the underlying interactions among imaging-derived parameters.
Collapse
|
14
|
Ogoe HA, Visweswaran S, Lu X, Gopalakrishnan V. Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data. BMC Bioinformatics 2015. [PMID: 26202217 PMCID: PMC4512094 DOI: 10.1186/s12859-015-0643-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Background Most ‘transcriptomic’ data from microarrays are generated from small sample sizes compared to the large number of measured biomarkers, making it very difficult to build accurate and generalizable disease state classification models. Integrating information from different, but related, ‘transcriptomic’ data may help build better classification models. However, most proposed methods for integrative analysis of ‘transcriptomic’ data cannot incorporate domain knowledge, which can improve model performance. To this end, we have developed a methodology that leverages transfer rule learning and functional modules, which we call TRL-FM, to capture and abstract domain knowledge in the form of classification rules to facilitate integrative modeling of multiple gene expression data. TRL-FM is an extension of the transfer rule learner (TRL) that we developed previously. The goal of this study was to test our hypothesis that “an integrative model obtained via the TRL-FM approach outperforms traditional models based on single gene expression data sources”. Results To evaluate the feasibility of the TRL-FM framework, we compared the area under the ROC curve (AUC) of models developed with TRL-FM and other traditional methods, using 21 microarray datasets generated from three studies on brain cancer, prostate cancer, and lung disease, respectively. The results show that TRL-FM statistically significantly outperforms TRL as well as traditional models based on single source data. In addition, TRL-FM performed better than other integrative models driven by meta-analysis and cross-platform data merging. Conclusions The capability of utilizing transferred abstract knowledge derived from source data using feature mapping enables the TRL-FM framework to mimic the human process of learning and adaptation when performing related tasks. The novel TRL-FM methodology for integrative modeling for multiple ‘transcriptomic’ datasets is able to intelligently incorporate domain knowledge that traditional methods might disregard, to boost predictive power and generalization performance. In this study, TRL-FM’s abstraction of knowledge is achieved in the form of functional modules, but the overall framework is generalizable in that different approaches of acquiring abstract knowledge can be integrated into this framework. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0643-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Henry A Ogoe
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, USA.
| | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, USA. .,Intelligent Systems Program, University of Pittsburgh, Pittsburgh, USA.
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, USA.
| | - Vanathi Gopalakrishnan
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, USA. .,Intelligent Systems Program, University of Pittsburgh, Pittsburgh, USA. .,Department of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, USA.
| |
Collapse
|
15
|
Naeini MP, Cooper GF, Hauskrecht M. Binary Classifier Calibration Using a Bayesian Non-Parametric Approach. PROCEEDINGS OF THE ... SIAM INTERNATIONAL CONFERENCE ON DATA MINING. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2015; 2015:208-216. [PMID: 26613068 DOI: 10.1137/1.9781611974010.24] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Learning probabilistic predictive models that are well calibrated is critical for many prediction and decision-making tasks in Data mining. This paper presents two new non-parametric methods for calibrating outputs of binary classification models: a method based on the Bayes optimal selection and a method based on the Bayesian model averaging. The advantage of these methods is that they are independent of the algorithm used to learn a predictive model, and they can be applied in a post-processing step, after the model is learned. This makes them applicable to a wide variety of machine learning models and methods. These calibration methods, as well as other methods, are tested on a variety of datasets in terms of both discrimination and calibration performance. The results show the methods either outperform or are comparable in performance to the state-of-the-art calibration methods.
Collapse
|
16
|
|
17
|
Zaidi AH, Gopalakrishnan V, Kasi PM, Zeng X, Malhotra U, Balasubramanian J, Visweswaran S, Sun M, Flint MS, Davison JM, Hood BL, Conrads TP, Bergman JJ, Bigbee WL, Jobe BA. Evaluation of a 4-protein serum biomarker panel-biglycan, annexin-A6, myeloperoxidase, and protein S100-A9 (B-AMP)-for the detection of esophageal adenocarcinoma. Cancer 2014; 120:3902-13. [PMID: 25100294 DOI: 10.1002/cncr.28963] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2014] [Revised: 06/23/2014] [Accepted: 07/22/2014] [Indexed: 12/14/2022]
Abstract
BACKGROUND Esophageal adenocarcinoma (EAC) is associated with a dismal prognosis. The identification of cancer biomarkers can advance the possibility for early detection and better monitoring of tumor progression and/or response to therapy. The authors present results from the development of a serum-based, 4-protein (biglycan, myeloperoxidase, annexin-A6, and protein S100-A9) biomarker panel for EAC. METHODS A vertically integrated, proteomics-based biomarker discovery approach was used to identify candidate serum biomarkers for the detection of EAC. Liquid chromatography-tandem mass spectrometry analysis was performed on formalin-fixed, paraffin-embedded tissue samples that were collected from across the Barrett esophagus (BE)-EAC disease spectrum. The mass spectrometry-based spectral count data were used to guide the selection of candidate serum biomarkers. Then, the serum enzyme-linked immunosorbent assay data were validated in an independent cohort and were used to develop a multiparametric risk-assessment model to predict the presence of disease. RESULTS With a minimum threshold of 10 spectral counts, 351 proteins were identified as differentially abundant along the spectrum of Barrett esophagus, high-grade dysplasia, and EAC (P<.05). Eleven proteins from this data set were then tested using enzyme-linked immunosorbent assays in serum samples, of which 5 proteins were significantly elevated in abundance among patients who had EAC compared with normal controls, which mirrored trends across the disease spectrum present in the tissue data. By using serum data, a Bayesian rule-learning predictive model with 4 biomarkers was developed to accurately classify disease class; the cross-validation results for the merged data set yielded accuracy of 87% and an area under the receiver operating characteristic curve of 93%. CONCLUSIONS Serum biomarkers hold significant promise for the early, noninvasive detection of EAC.
Collapse
Affiliation(s)
- Ali H Zaidi
- Institute for the Treatment of Esophageal and Thoracic Disease, Allegheny Health Network, Pittsburgh, Pennsylvania
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
18
|
AbdelRahman SE, Zhang M, Bray BE, Kawamoto K. A three-step approach for the derivation and validation of high-performing predictive models using an operational dataset: congestive heart failure readmission case study. BMC Med Inform Decis Mak 2014; 14:41. [PMID: 24886637 PMCID: PMC4074427 DOI: 10.1186/1472-6947-14-41] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2013] [Accepted: 05/06/2014] [Indexed: 11/23/2022] Open
Abstract
Background The aim of this study was to propose an analytical approach to develop high-performing predictive models for congestive heart failure (CHF) readmission using an operational dataset with incomplete records and changing data over time. Methods Our analytical approach involves three steps: pre-processing, systematic model development, and risk factor analysis. For pre-processing, variables that were absent in >50% of records were removed. Moreover, the dataset was divided into a validation dataset and derivation datasets which were separated into three temporal subsets based on changes to the data over time. For systematic model development, using the different temporal datasets and the remaining explanatory variables, the models were developed by combining the use of various (i) statistical analyses to explore the relationships between the validation and the derivation datasets; (ii) adjustment methods for handling missing values; (iii) classifiers; (iv) feature selection methods; and (iv) discretization methods. We then selected the best derivation dataset and the models with the highest predictive performance. For risk factor analysis, factors in the highest-performing predictive models were analyzed and ranked using (i) statistical analyses of the best derivation dataset, (ii) feature rankers, and (iii) a newly developed algorithm to categorize risk factors as being strong, regular, or weak. Results The analysis dataset consisted of 2,787 CHF hospitalizations at University of Utah Health Care from January 2003 to June 2013. In this study, we used the complete-case analysis and mean-based imputation adjustment methods; the wrapper subset feature selection method; and four ranking strategies based on information gain, gain ratio, symmetrical uncertainty, and wrapper subset feature evaluators. The best-performing models resulted from the use of a complete-case analysis derivation dataset combined with the Class-Attribute Contingency Coefficient discretization method and a voting classifier which averaged the results of multi-nominal logistic regression and voting feature intervals classifiers. Of 42 final model risk factors, discharge disposition, discretized age, and indicators of anemia were the most significant. This model achieved a c-statistic of 86.8%. Conclusion The proposed three-step analytical approach enhanced predictive model performance for CHF readmissions. It could potentially be leveraged to improve predictive model performance in other areas of clinical medicine.
Collapse
Affiliation(s)
- Samir E AbdelRahman
- Department of Biomedical Informatics, University of Utah, 615 Arapeen Way, Suite 208, Salt Lake City, UT 84092, USA.
| | | | | | | |
Collapse
|
19
|
Balasubramanian JB, Visweswaran S, Cooper GF, Gopalakrishnan V. Selective model averaging with bayesian rule learning for predictive biomedicine. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2014; 2014:17-22. [PMID: 25717394 PMCID: PMC4333697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Accurate disease classification and biomarker discovery remain challenging tasks in biomedicine. In this paper, we develop and test a practical approach to combining evidence from multiple models when making predictions using selective Bayesian model averaging of probabilistic rules. This method is implemented within a Bayesian Rule Learning system and compared to model selection when applied to twelve biomedical datasets using the area under the ROC curve measure of performance. Cross-validation results indicate that selective Bayesian model averaging statistically significantly outperforms model selection on average in these experiments, suggesting that combining predictions from multiple models may lead to more accurate quantification of classifier uncertainty. This approach would directly impact the generation of robust predictions on unseen test data, while also increasing knowledge for biomarker discovery and mechanisms that underlie disease.
Collapse
Affiliation(s)
- Jeya B. Balasubramanian
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA,Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA
| | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA,Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA,Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA
| | - Gregory F. Cooper
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA,Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA,Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA
| | - Vanathi Gopalakrishnan
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA,Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA,Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA
| |
Collapse
|
20
|
Maslove DM, Podchiyska T, Lowe HJ. Discretization of continuous features in clinical datasets. J Am Med Inform Assoc 2012; 20:544-53. [PMID: 23059731 DOI: 10.1136/amiajnl-2012-000929] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
BACKGROUND The increasing availability of clinical data from electronic medical records (EMRs) has created opportunities for secondary uses of health information. When used in machine learning classification, many data features must first be transformed by discretization. OBJECTIVE To evaluate six discretization strategies, both supervised and unsupervised, using EMR data. MATERIALS AND METHODS We classified laboratory data (arterial blood gas (ABG) measurements) and physiologic data (cardiac output (CO) measurements) derived from adult patients in the intensive care unit using decision trees and naïve Bayes classifiers. Continuous features were partitioned using two supervised, and four unsupervised discretization strategies. The resulting classification accuracy was compared with that obtained with the original, continuous data. RESULTS Supervised methods were more accurate and consistent than unsupervised, but tended to produce larger decision trees. Among the unsupervised methods, equal frequency and k-means performed well overall, while equal width was significantly less accurate. DISCUSSION This is, we believe, the first dedicated evaluation of discretization strategies using EMR data. It is unlikely that any one discretization method applies universally to EMR data. Performance was influenced by the choice of class labels and, in the case of unsupervised methods, the number of intervals. In selecting the number of intervals there is generally a trade-off between greater accuracy and greater consistency. CONCLUSIONS In general, supervised methods yield higher accuracy, but are constrained to a single specific application. Unsupervised methods do not require class labels and can produce discretized data that can be used for multiple purposes.
Collapse
Affiliation(s)
- David M Maslove
- Center for Clinical Informatics, Stanford University School of Medicine, Stanford, CA94305, USA.
| | | | | |
Collapse
|
21
|
Lustgarten JL, Gopalakrishnan V, Grover H, Visweswaran S. Improving classification performance with discretization on biomedical datasets. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2008; 2008:445-9. [PMID: 18999186 PMCID: PMC2656082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 03/14/2008] [Revised: 07/01/2008] [Indexed: 05/27/2023]
Abstract
Discretization acts as a variable selection method in addition to transforming the continuous values of the variable to discrete ones. Machine learning algorithms such as Support Vector Machines and Random Forests have been used for classification in high-dimensional genomic and proteomic data due to their robustness to the dimensionality of the data. We show that discretization can help improve significantly the classification performance of these algorithms as well as algorithms like Naïve Bayes that are sensitive to the dimensionality of the data.
Collapse
|