26
|
Cai C, Cooper GF, Lu KN, Ma X, Xu S, Zhao Z, Chen X, Xue Y, Lee AV, Clark N, Chen V, Lu S, Chen L, Yu L, Hochheiser HS, Jiang X, Wang QJ, Lu X. Systematic discovery of the functional impact of somatic genome alterations in individual tumors through tumor-specific causal inference. PLoS Comput Biol 2019; 15:e1007088. [PMID: 31276486 PMCID: PMC6650088 DOI: 10.1371/journal.pcbi.1007088] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Revised: 07/23/2019] [Accepted: 05/09/2019] [Indexed: 02/07/2023] Open
Abstract
Cancer is mainly caused by somatic genome alterations (SGAs). Precision oncology involves identifying and targeting tumor-specific aberrations resulting from causative SGAs. We developed a novel tumor-specific computational framework that finds the likely causative SGAs in an individual tumor and estimates their impact on oncogenic processes, which suggests the disease mechanisms that are acting in that tumor. This information can be used to guide precision oncology. We report a tumor-specific causal inference (TCI) framework, which estimates causative SGAs by modeling causal relationships between SGAs and molecular phenotypes (e.g., transcriptomic, proteomic, or metabolomic changes) within an individual tumor. We applied the TCI algorithm to tumors from The Cancer Genome Atlas (TCGA) and estimated for each tumor the SGAs that causally regulate the differentially expressed genes (DEGs) in that tumor. Overall, TCI identified 634 SGAs that are predicted to cause cancer-related DEGs in a significant number of tumors, including most of the previously known drivers and many novel candidate cancer drivers. The inferred causal relationships are statistically robust and biologically sensible, and multiple lines of experimental evidence support the predicted functional impact of both the well-known and the novel candidate drivers that are predicted by TCI. TCI provides a unified framework that integrates multiple types of SGAs and molecular phenotypes to estimate which genome perturbations are causally influencing one or more molecular/cellular phenotypes in an individual tumor. By identifying major candidate drivers and revealing their functional impact in an individual tumor, TCI sheds light on the disease mechanisms of that tumor, which can serve to advance our basic knowledge of cancer biology and to support precision oncology that provides tailored treatment of individual tumors.
Collapse
|
27
|
King AJ, Cooper GF, Hochheiser H, Clermont G, Hauskrecht M, Visweswaran S. Using Machine Learning to Predict the Information Seeking Behavior of Clinicians Using an Electronic Medical Record System. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:673-682. [PMID: 30815109 PMCID: PMC6371238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Poor electronic medical record (EMR) usability is detrimental to both clinicians and patients. A better EMR would provide concise, context sensitive patient data, but doing so entails the difficult task of knowing which data are relevant. To determine the relevance of patient data in different contexts, we collect and model the information seeking behavior of clinicians using a learning EMR (LEMR) system. Sufficient data were collected to train predictive models for 80 different targets (e.g., glucose level, heparin administration) and 27 of them had AUROC values of greater than 0.7. These results are encouraging considering the high variation in information seeking behavior (intraclass correlation 0.40). We plan to apply these models to a new set of patient cases and adapt the LEMR interface to highlight relevant patient data, and thus provide concise, context sensitive data.
Collapse
|
28
|
Jabbari F, Visweswaran S, Cooper GF. Instance-Specific Bayesian Network Structure Learning. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2018; 72:169-180. [PMID: 30775723 PMCID: PMC6376975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Bayesian network (BN) structure learning algorithms are almost always designed to recover the structure that models the relationships that are shared by the instances in a population. While accurately learning such population-wide Bayesian networks is useful, learning Bayesian networks that are specific to each instance is often important as well. For example, to understand and treat a patient (instance), it is critical to understand the specific causal mechanisms that are operating in that particular patient. We introduce an instance-specific BN structure learning method that searches the space of Bayesian networks to build a model that is specific to an instance by guiding the search based on attributes of the given instance (e.g., patient symptoms, signs, lab results, and genotype). The structure discovery performance of the proposed method is compared to an existing state-of-the-art BN structure learning method, namely an implementation of the Greedy Equivalence Search algorithm called FGES, using both simulated and real data. The results show that the proposed method improves the precision of the model structure that is output, when compared to GES, especially for those variables that exhibit context-specific independence.
Collapse
|
29
|
Ding MQ, Chen L, Cooper GF, Young JD, Lu X. Precision Oncology beyond Targeted Therapy: Combining Omics Data with Machine Learning Matches the Majority of Cancer Cells to Effective Therapeutics. Mol Cancer Res 2017; 16:269-278. [PMID: 29133589 DOI: 10.1158/1541-7786.mcr-17-0378] [Citation(s) in RCA: 89] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Revised: 10/02/2017] [Accepted: 11/02/2017] [Indexed: 02/06/2023]
Abstract
Precision oncology involves identifying drugs that will effectively treat a tumor and then prescribing an optimal clinical treatment regimen. However, most first-line chemotherapy drugs do not have biomarkers to guide their application. For molecularly targeted drugs, using the genomic status of a drug target as a therapeutic indicator has limitations. In this study, machine learning methods (e.g., deep learning) were used to identify informative features from genome-scale omics data and to train classifiers for predicting the effectiveness of drugs in cancer cell lines. The methodology introduced here can accurately predict the efficacy of drugs, regardless of whether they are molecularly targeted or nonspecific chemotherapy drugs. This approach, on a per-drug basis, can identify sensitive cancer cells with an average sensitivity of 0.82 and specificity of 0.82; on a per-cell line basis, it can identify effective drugs with an average sensitivity of 0.80 and specificity of 0.82. This report describes a data-driven precision medicine approach that is not only generalizable but also optimizes therapeutic efficacy. The framework detailed herein, when successfully translated to clinical environments, could significantly broaden the scope of precision oncology beyond targeted therapies, benefiting an expanded proportion of cancer patients. Mol Cancer Res; 16(2); 269-78. ©2017 AACR.
Collapse
|
30
|
Aronis JM, Millett NE, Wagner MM, Tsui F, Ye Y, Ferraro JP, Haug PJ, Gesteland PH, Cooper GF. A Bayesian system to detect and characterize overlapping outbreaks. J Biomed Inform 2017; 73:171-181. [PMID: 28797710 PMCID: PMC5604259 DOI: 10.1016/j.jbi.2017.08.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2016] [Revised: 07/04/2017] [Accepted: 08/04/2017] [Indexed: 10/19/2022]
Abstract
Outbreaks of infectious diseases such as influenza are a significant threat to human health. Because there are different strains of influenza which can cause independent outbreaks, and influenza can affect demographic groups at different rates and times, there is a need to recognize and characterize multiple outbreaks of influenza. This paper describes a Bayesian system that uses data from emergency department patient care reports to create epidemiological models of overlapping outbreaks of influenza. Clinical findings are extracted from patient care reports using natural language processing. These findings are analyzed by a case detection system to create disease likelihoods that are passed to a multiple outbreak detection system. We evaluated the system using real and simulated outbreaks. The results show that this approach can recognize and characterize overlapping outbreaks of influenza. We describe several extensions that appear promising.
Collapse
|
31
|
King AJ, Hochheiser H, Visweswaran S, Clermont G, Cooper GF. Eye-tracking for clinical decision support: A method to capture automatically what physicians are viewing in the EMR. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2017; 2017:512-521. [PMID: 28815151 PMCID: PMC5543363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Eye-tracking is a valuable research tool that is used in laboratory and limited field environments. We take steps toward developing methods that enable widespread adoption of eye-tracking and its real-time application in clinical decision support. Eye-tracking will enhance awareness and enable intelligent views, more precise alerts, and other forms of decision support in the Electronic Medical Record (EMR). We evaluated a low-cost eye-tracking device and found the device's accuracy to be non-inferior to a more expensive device. We also developed and evaluated an automatic method for mapping eye-tracking data to interface elements in the EMR (e.g., a displayed laboratory test value). Mapping was 88% accurate across the six participants in our experiment. Finally, we piloted the use of the low-cost device and the automatic mapping method to label training data for a Learning EMR (LEMR) which is a system that highlights the EMR elements a physician is predicted to use.
Collapse
|
32
|
Ferraro JP, Ye Y, Gesteland PH, Haug PJ, Tsui FR, Cooper GF, Van Bree R, Ginter T, Nowalk AJ, Wagner M. The effects of natural language processing on cross-institutional portability of influenza case detection for disease surveillance. Appl Clin Inform 2017; 8:560-580. [PMID: 28561130 PMCID: PMC6241736 DOI: 10.4338/aci-2016-12-ra-0211] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2016] [Accepted: 03/11/2017] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVES This study evaluates the accuracy and portability of a natural language processing (NLP) tool for extracting clinical findings of influenza from clinical notes across two large healthcare systems. Effectiveness is evaluated on how well NLP supports downstream influenza case-detection for disease surveillance. METHODS We independently developed two NLP parsers, one at Intermountain Healthcare (IH) in Utah and the other at University of Pittsburgh Medical Center (UPMC) using local clinical notes from emergency department (ED) encounters of influenza. We measured NLP parser performance for the presence and absence of 70 clinical findings indicative of influenza. We then developed Bayesian network models from NLP processed reports and tested their ability to discriminate among cases of (1) influenza, (2) non-influenza influenza-like illness (NI-ILI), and (3) 'other' diagnosis. RESULTS On Intermountain Healthcare reports, recall and precision of the IH NLP parser were 0.71 and 0.75, respectively, and UPMC NLP parser, 0.67 and 0.79. On University of Pittsburgh Medical Center reports, recall and precision of the UPMC NLP parser were 0.73 and 0.80, respectively, and IH NLP parser, 0.53 and 0.80. Bayesian case-detection performance measured by AUROC for influenza versus non-influenza on Intermountain Healthcare cases was 0.93 (using IH NLP parser) and 0.93 (using UPMC NLP parser). Case-detection on University of Pittsburgh Medical Center cases was 0.95 (using UPMC NLP parser) and 0.83 (using IH NLP parser). For influenza versus NI-ILI on Intermountain Healthcare cases performance was 0.70 (using IH NLP parser) and 0.76 (using UPMC NLP parser). On University of Pisstburgh Medical Center cases, 0.76 (using UPMC NLP parser) and 0.65 (using IH NLP parser). CONCLUSION In all but one instance (influenza versus NI-ILI using IH cases), local parsers were more effective at supporting case-detection although performances of non-local parsers were reasonable.
Collapse
|
33
|
Ye Y, Wagner MM, Cooper GF, Ferraro JP, Su H, Gesteland PH, Haug PJ, Millett NE, Aronis JM, Nowalk AJ, Ruiz VM, López Pineda A, Shi L, Van Bree R, Ginter T, Tsui F. A study of the transferability of influenza case detection systems between two large healthcare systems. PLoS One 2017; 12:e0174970. [PMID: 28380048 PMCID: PMC5381795 DOI: 10.1371/journal.pone.0174970] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2016] [Accepted: 03/17/2017] [Indexed: 01/16/2023] Open
Abstract
Objectives This study evaluates the accuracy and transferability of Bayesian case detection systems (BCD) that use clinical notes from emergency department (ED) to detect influenza cases. Methods A BCD uses natural language processing (NLP) to infer the presence or absence of clinical findings from ED notes, which are fed into a Bayesain network classifier (BN) to infer patients’ diagnoses. We developed BCDs at the University of Pittsburgh Medical Center (BCDUPMC) and Intermountain Healthcare in Utah (BCDIH). At each site, we manually built a rule-based NLP and trained a Bayesain network classifier from over 40,000 ED encounters between Jan. 2008 and May. 2010 using feature selection, machine learning, and expert debiasing approach. Transferability of a BCD in this study may be impacted by seven factors: development (source) institution, development parser, application (target) institution, application parser, NLP transfer, BN transfer, and classification task. We employed an ANOVA analysis to study their impacts on BCD performance. Results Both BCDs discriminated well between influenza and non-influenza on local test cases (AUCs > 0.92). When tested for transferability using the other institution’s cases, BCDUPMC discriminations declined minimally (AUC decreased from 0.95 to 0.94, p<0.01), and BCDIH discriminations declined more (from 0.93 to 0.87, p<0.0001). We attributed the BCDIH decline to the lower recall of the IH parser on UPMC notes. The ANOVA analysis showed five significant factors: development parser, application institution, application parser, BN transfer, and classification task. Conclusion We demonstrated high influenza case detection performance in two large healthcare systems in two geographically separated regions, providing evidentiary support for the use of automated case detection from routinely collected electronic clinical notes in national influenza surveillance. The transferability could be improved by training Bayesian network classifier locally and increasing the accuracy of the NLP parser.
Collapse
|
34
|
Naeini MP, Cooper GF. Binary Classifier Calibration Using an Ensemble of Linear Trend Estimation. PROCEEDINGS OF THE ... SIAM INTERNATIONAL CONFERENCE ON DATA MINING. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2017; 2016:261-269. [PMID: 28357158 DOI: 10.1137/1.9781611974348.30] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called ensemble of linear trend estimation (ELiTE). ELiTE utilizes the recently proposed ℓ1 trend ltering signal approximation method [22] to find the mapping from uncalibrated classification scores to the calibrated probability estimates. ELiTE is designed to address the key limitations of the histogram binning-based calibration methods which are (1) the use of a piecewise constant form of the calibration mapping using bins, and (2) the assumption of independence of predicted probabilities for the instances that are located in different bins. The method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus, it can be applied with many existing classification models. We demonstrate the performance of ELiTE on real datasets for commonly used binary classification models. Experimental results show that the method outperforms several common binary-classifier calibration methods. In particular, ELiTE commonly performs statistically significantly better than the other methods, and never worse. Moreover, it is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is practically O(N log N) time, where N is the number of samples.
Collapse
|
35
|
Naeini MP, Cooper GF. Binary Classifier Calibration using an Ensemble of Near Isotonic Regression Models. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON DATA MINING 2017; 2016:360-369. [PMID: 28316511 PMCID: PMC5351887 DOI: 10.1109/icdm.2016.0047] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called ensemble of near isotonic regression (ENIR). The method can be considered as an extension of BBQ [20], a recently proposed calibration method, as well as the commonly used calibration method based on isotonic regression (IsoRegC) [27]. ENIR is designed to address the key limitation of IsoRegC which is the monotonicity assumption of the predictions. Similar to BBQ, the method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus it can be used with many existing classification models to generate accurate probabilistic predictions. We demonstrate the performance of ENIR on synthetic and real datasets for commonly applied binary classification models. Experimental results show that the method outperforms several common binary classifier calibration methods. In particular on the real data, ENIR commonly performs statistically significantly better than the other methods, and never worse. It is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is O(N log N) time, where N is the number of samples.
Collapse
|
36
|
Lu S, Cai C, Yan G, Zhou Z, Wan Y, Chen V, Chen L, Cooper GF, Obeid LM, Hannun YA, Lee AV, Lu X. Signal-Oriented Pathway Analyses Reveal a Signaling Complex as a Synthetic Lethal Target for p53 Mutations. Cancer Res 2016; 76:6785-6794. [PMID: 27758891 DOI: 10.1158/0008-5472.can-16-1740] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Revised: 08/31/2016] [Accepted: 09/18/2016] [Indexed: 11/16/2022]
Abstract
Defining processes that are synthetic lethal with p53 mutations in cancer cells may reveal possible therapeutic strategies. In this study, we report the development of a signal-oriented computational framework for cancer pathway discovery in this context. We applied our bipartite graph-based functional module discovery algorithm to identify transcriptomic modules abnormally expressed in multiple tumors, such that the genes in a module were likely regulated by a common, perturbed signal. For each transcriptomic module, we applied our weighted k-path merge algorithm to search for a set of somatic genome alterations (SGA) that likely perturbed the signal, that is, the candidate members of the pathway that regulate the transcriptomic module. Computational evaluations indicated that our methods-identified pathways were perturbed by SGA. In particular, our analyses revealed that SGA affecting TP53, PTK2, YWHAZ, and MED1 perturbed a set of signals that promote cell proliferation, anchor-free colony formation, and epithelial-mesenchymal transition (EMT). These proteins formed a signaling complex that mediates these oncogenic processes in a coordinated fashion. Disruption of this signaling complex by knocking down PTK2, YWHAZ, or MED1 attenuated and reversed oncogenic phenotypes caused by mutant p53 in a synthetic lethal manner. This signal-oriented framework for searching pathways and therapeutic targets is applicable to all cancer types, thus potentially impacting precision medicine in cancer. Cancer Res; 76(23); 6785-94. ©2016 AACR.
Collapse
|
37
|
López Pineda A, Ye Y, Visweswaran S, Cooper GF, Wagner MM, Tsui FR. Comparison of machine learning classifiers for influenza detection from emergency department free-text reports. J Biomed Inform 2015; 58:60-69. [PMID: 26385375 PMCID: PMC4684714 DOI: 10.1016/j.jbi.2015.08.019] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2015] [Revised: 05/28/2015] [Accepted: 08/21/2015] [Indexed: 12/31/2022]
Abstract
Influenza is a yearly recurrent disease that has the potential to become a pandemic. An effective biosurveillance system is required for early detection of the disease. In our previous studies, we have shown that electronic Emergency Department (ED) free-text reports can be of value to improve influenza detection in real time. This paper studies seven machine learning (ML) classifiers for influenza detection, compares their diagnostic capabilities against an expert-built influenza Bayesian classifier, and evaluates different ways of handling missing clinical information from the free-text reports. We identified 31,268 ED reports from 4 hospitals between 2008 and 2011 to form two different datasets: training (468 cases, 29,004 controls), and test (176 cases and 1620 controls). We employed Topaz, a natural language processing (NLP) tool, to extract influenza-related findings and to encode them into one of three values: Acute, Non-acute, and Missing. Results show that all ML classifiers had areas under ROCs (AUC) ranging from 0.88 to 0.93, and performed significantly better than the expert-built Bayesian model. Missing clinical information marked as a value of missing (not missing at random) had a consistently improved performance among 3 (out of 4) ML classifiers when it was compared with the configuration of not assigning a value of missing (missing completely at random). The case/control ratios did not affect the classification performance given the large number of training cases. Our study demonstrates ED reports in conjunction with the use of ML and NLP with the handling of missing value information have a great potential for the detection of infectious diseases.
Collapse
|
38
|
King AJ, Cooper GF, Hochheiser H, Clermont G, Visweswaran S. Development and Preliminary Evaluation of a Prototype of a Learning Electronic Medical Record System. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2015; 2015:1967-1975. [PMID: 26958296 PMCID: PMC4765593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Electronic medical records (EMRs) are capturing increasing amounts of data per patient. For clinicians to efficiently and accurately understand a patient's clinical state, better ways are needed to determine when and how to display EMR data. We built a prototype system that records how physicians view EMR data, which we used to train models that predict which EMR data will be relevant in a given patient. We call this approach a Learning EMR (LEMR). A physician used the prototype to review 59 intensive care unit (ICU) patient cases. We used the data-access patterns from these cases to train logistic regression models that, when evaluated, had AUROC values as high as 0.92 and that averaged 0.73, supporting that the approach is promising. A preliminary usability study identified advantages of the system and a few concerns about implementation. Overall, 3 of 4 ICU physicians were enthusiastic about features of the prototype.
Collapse
|
39
|
Cooper GF, Bahar I, Becich MJ, Benos PV, Berg J, Espino JU, Glymour C, Jacobson RC, Kienholz M, Lee AV, Lu X, Scheines R. The center for causal discovery of biomedical knowledge from big data. J Am Med Inform Assoc 2015; 22:1132-6. [PMID: 26138794 PMCID: PMC5009908 DOI: 10.1093/jamia/ocv059] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Revised: 04/27/2015] [Accepted: 05/02/2015] [Indexed: 01/12/2023] Open
Abstract
The Big Data to Knowledge (BD2K) Center for Causal Discovery is developing and disseminating an integrated set of open source tools that support causal modeling and discovery of biomedical knowledge from large and complex biomedical datasets. The Center integrates teams of biomedical and data scientists focused on the refinement of existing and the development of new constraint-based and Bayesian algorithms based on causal Bayesian networks, the optimization of software for efficient operation in a supercomputing environment, and the testing of algorithms and software developed using real data from 3 representative driving biomedical projects: cancer driver mutations, lung disease, and the functional connectome of the human brain. Associated training activities provide both biomedical and data scientists with the knowledge and skills needed to apply and extend these tools. Collaborative activities with the BD2K Consortium further advance causal discovery tools and integrate tools and resources developed by other centers.
Collapse
|
40
|
Naeini MP, Cooper GF, Hauskrecht M. Binary Classifier Calibration Using a Bayesian Non-Parametric Approach. PROCEEDINGS OF THE ... SIAM INTERNATIONAL CONFERENCE ON DATA MINING. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2015; 2015:208-216. [PMID: 26613068 DOI: 10.1137/1.9781611974010.24] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Learning probabilistic predictive models that are well calibrated is critical for many prediction and decision-making tasks in Data mining. This paper presents two new non-parametric methods for calibrating outputs of binary classification models: a method based on the Bayes optimal selection and a method based on the Bayesian model averaging. The advantage of these methods is that they are independent of the algorithm used to learn a predictive model, and they can be applied in a post-processing step, after the model is learned. This makes them applicable to a wide variety of machine learning models and methods. These calibration methods, as well as other methods, are tested on a variety of datasets in terms of both discrimination and calibration performance. The results show the methods either outperform or are comparable in performance to the state-of-the-art calibration methods.
Collapse
|
41
|
Visweswaran S, Ferreira A, Ribeiro GA, Oliveira AC, Cooper GF. Personalized Modeling for Prediction with Decision-Path Models. PLoS One 2015; 10:e0131022. [PMID: 26098570 PMCID: PMC4476684 DOI: 10.1371/journal.pone.0131022] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 05/26/2015] [Indexed: 11/25/2022] Open
Abstract
Deriving predictive models in medicine typically relies on a population approach where a single model is developed from a dataset of individuals. In this paper we describe and evaluate a personalized approach in which we construct a new type of decision tree model called decision-path model that takes advantage of the particular features of a given person of interest. We introduce three personalized methods that derive personalized decision-path models. We compared the performance of these methods to that of Classification And Regression Tree (CART) that is a population decision tree to predict seven different outcomes in five medical datasets. Two of the three personalized methods performed statistically significantly better on area under the ROC curve (AUC) and Brier skill score compared to CART. The personalized approach of learning decision path models is a new approach for predictive modeling that can perform better than a population approach.
Collapse
|
42
|
Naeini MP, Cooper GF, Hauskrecht M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. PROCEEDINGS OF THE ... AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE. AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE 2015; 2015:2901-2907. [PMID: 25927013 PMCID: PMC4410090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Learning probabilistic predictive models that are well calibrated is critical for many prediction and decision-making tasks in artificial intelligence. In this paper we present a new non-parametric calibration method called Bayesian Binning into Quantiles (BBQ) which addresses key limitations of existing calibration methods. The method post processes the output of a binary classification algorithm; thus, it can be readily combined with many existing classification algorithms. The method is computationally tractable, and empirically accurate, as evidenced by the set of experiments reported here on both real and simulated datasets.
Collapse
|
43
|
Avali VR, Cooper GF, Gopalakrishnan V. Application of Bayesian logistic regression to mining biomedical data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2014; 2014:266-273. [PMID: 25954328 PMCID: PMC4419893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Mining high dimensional biomedical data with existing classifiers is challenging and the predictions are often inaccurate. We investigated the use of Bayesian Logistic Regression (B-LR) for mining such data to predict and classify various disease conditions. The analysis was done on twelve biomedical datasets with binary class variables and the performance of B-LR was compared to those from other popular classifiers on these datasets with 10-fold cross validation using the WEKA data mining toolkit. The statistical significance of the results was analyzed by paired two tailed t-tests and non-parametric Wilcoxon signed-rank tests. We observed overall that B-LR with non-informative Gaussian priors performed on par with other classifiers in terms of accuracy, balanced accuracy and AUC. These results suggest that it is worthwhile to explore the application of B-LR to predictive modeling tasks in bioinformatics using informative biological prior probabilities. With informative prior probabilities, we conjecture that the performance of B-LR will improve.
Collapse
|
44
|
Jiang X, Cai B, Xue D, Lu X, Cooper GF, Neapolitan RE. A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets. J Am Med Inform Assoc 2014; 21:e312-9. [PMID: 24737607 PMCID: PMC4173174 DOI: 10.1136/amiajnl-2013-002358] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2013] [Revised: 02/20/2014] [Accepted: 03/14/2014] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions. METHOD We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10,000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation. RESULTS In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data. DISCUSSION EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased. CONCLUSIONS Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC.
Collapse
|
45
|
Cooper GF, Villamarin R, Rich Tsui FC, Millett N, Espino JU, Wagner MM. A method for detecting and characterizing outbreaks of infectious disease from clinical reports. J Biomed Inform 2014; 53:15-26. [PMID: 25181466 DOI: 10.1016/j.jbi.2014.08.011] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2014] [Revised: 08/04/2014] [Accepted: 08/22/2014] [Indexed: 11/30/2022]
Abstract
Outbreaks of infectious disease can pose a significant threat to human health. Thus, detecting and characterizing outbreaks quickly and accurately remains an important problem. This paper describes a Bayesian framework that links clinical diagnosis of individuals in a population to epidemiological modeling of disease outbreaks in the population. Computer-based diagnosis of individuals who seek healthcare is used to guide the search for epidemiological models of population disease that explain the pattern of diagnoses well. We applied this framework to develop a system that detects influenza outbreaks from emergency department (ED) reports. The system diagnoses influenza in individuals probabilistically from evidence in ED reports that are extracted using natural language processing. These diagnoses guide the search for epidemiological models of influenza that explain the pattern of diagnoses well. Those epidemiological models with a high posterior probability determine the most likely outbreaks of specific diseases; the models are also used to characterize properties of an outbreak, such as its expected peak day and estimated size. We evaluated the method using both simulated data and data from a real influenza outbreak. The results provide support that the approach can detect and characterize outbreaks early and well enough to be valuable. We describe several extensions to the approach that appear promising.
Collapse
|
46
|
Balasubramanian JB, Visweswaran S, Cooper GF, Gopalakrishnan V. Selective model averaging with bayesian rule learning for predictive biomedicine. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2014; 2014:17-22. [PMID: 25717394 PMCID: PMC4333697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Accurate disease classification and biomarker discovery remain challenging tasks in biomedicine. In this paper, we develop and test a practical approach to combining evidence from multiple models when making predictions using selective Bayesian model averaging of probabilistic rules. This method is implemented within a Bayesian Rule Learning system and compared to model selection when applied to twelve biomedical datasets using the area under the ROC curve measure of performance. Cross-validation results indicate that selective Bayesian model averaging statistically significantly outperforms model selection on average in these experiments, suggesting that combining predictions from multiple models may lead to more accurate quantification of classifier uncertainty. This approach would directly impact the generation of robust predictions on unseen test data, while also increasing knowledge for biomarker discovery and mechanisms that underlie disease.
Collapse
|
47
|
Ferreira A, Cooper GF, Visweswaran S. Decision path models for patient-specific modeling of patient outcomes. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2013; 2013:413-421. [PMID: 24551347 PMCID: PMC3900188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Patient-specific models are constructed to take advantage of the particular features of the patient case of interest compared to commonly used population-wide models that are constructed to perform well on average on all cases. We introduce two patient-specific algorithms that are based on the decision tree paradigm. These algorithms construct a decision path specific for each patient of interest compared to a single population-wide decision tree with many paths that is applicable to all patients of interest that are constructed by standard algorithms. We applied the patient-specific algorithms to predict five different outcomes in clinical datasets. Compared to the population-wide CART decision tree the patient-specific decision path models had superior performance on area under the ROC curve (AUC) and had comparable performance on balanced accuracy. Our results provide support for patient-specific algorithms being a promising approach for predicting clinical outcomes.
Collapse
|
48
|
Montefusco DJ, Chen L, Matmati N, Lu S, Newcomb B, Cooper GF, Hannun YA, Lu X. Distinct signaling roles of ceramide species in yeast revealed through systematic perturbation and systems biology analyses. Sci Signal 2013; 6:rs14. [PMID: 24170935 PMCID: PMC3974757 DOI: 10.1126/scisignal.2004515] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Ceramide, the central molecule of sphingolipid metabolism, is an important bioactive molecule that participates in various cellular regulatory events and that has been implicated in disease. Deciphering ceramide signaling is challenging because multiple ceramide species exist, and many of them may have distinct functions. We applied systems biology and molecular approaches to perturb ceramide metabolism in the yeast Saccharomyces cerevisiae and inferred causal relationships between ceramide species and their potential targets by combining lipidomic, genomic, and transcriptomic analyses. We found that during heat stress, distinct metabolic mechanisms controlled the abundance of different groups of ceramide species and provided experimental support for the importance of the dihydroceramidase Ydc1 in mediating the decrease in dihydroceramides during heat stress. Additionally, distinct groups of ceramide species, with different N-acyl chains and hydroxylations, regulated different sets of functionally related genes, indicating that the structural complexity of these lipids produces functional diversity. The transcriptional modules that we identified provide a resource to begin to dissect the specific functions of ceramides.
Collapse
|
49
|
Batal I, Valizadegan H, Cooper GF, Hauskrecht M. A Temporal Pattern Mining Approach for Classifying Electronic Health Record Data. ACM T INTEL SYST TEC 2013; 4:10.1145/2508037.2508044. [PMID: 25309815 PMCID: PMC4192602 DOI: 10.1145/2508037.2508044] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2011] [Accepted: 08/01/2013] [Indexed: 10/26/2022]
Abstract
We study the problem of learning classification models from complex multivariate temporal data encountered in electronic health record systems. The challenge is to define a good set of features that are able to represent well the temporal aspect of the data. Our method relies on temporal abstractions and temporal pattern mining to extract the classification features. Temporal pattern mining usually returns a large number of temporal patterns, most of which may be irrelevant to the classification task. To address this problem, we present the Minimal Predictive Temporal Patterns framework to generate a small set of predictive and non-spurious patterns. We apply our approach to the real-world clinical task of predicting patients who are at risk of developing heparin induced thrombocytopenia. The results demonstrate the benefit of our approach in efficiently learning accurate classifiers, which is a key step for developing intelligent clinical monitoring systems.
Collapse
|
50
|
Pineda AL, Tsui FC, Visweswaran S, Cooper GF. Detection of Patients with Influenza Syndrome Using Machine-Learning Models Learned from Emergency Department Reports. Online J Public Health Inform 2013. [PMCID: PMC3692886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Objective Compare 7 machine learning algorithms with an expert constructed Bayesian network on detection of patients with influenza syndrome. Introduction Early detection of influenza outbreaks is critical to public health officials. Case detection is the foundation for outbreak detection. Previous study by Elkin el al. demonstrated that using individual emergency department (ED) reports can better detect influenza cases than using chief complaints [1]. Our recent study using ED reports processed by Bayesian networks (using expert constructed network structure) showed high detection accuracy on detection of influenza cases [2]. Methods The dataset used in this study includes 182 ED reports with confirmed PCR influenza tests (Jan 1, 2007–Dec 31, 2009) and 40853 ED reports as control cases from 8 EDs in UPMC (Jul 1, 2010–Aug 31, 2010). All ED reports were deidentified by De-ID software with IRB approval. An NLP system, Topaz, was used to extract relevant findings and symptoms from the reports and encoded them with the UMLS concept unique identifier codes [2]. Two subsets were created: DS1-train (67% of cases) and DS1-test (remaining 33%). The algorithms used for training the models are: Naïve Bayes Classifier, Efficient Bayesian Multivariate Classification (EBMC) [3], Bayesian Network with K2 algorithm, Logistic Regression (LR), Support Vector Machine (SVM), Artificial Neural Networks (ANN) and Random Forest (RF). The predictive performance of each method was evaluated using the area under the receiver operator characteristic (AUROC) and the Hosmer-Lemeshow (HL) statistical significance testing, that describes the lack-of-fit of the model to the dataset. Results The evaluation results of all the models using DS1-test, including the AUROC, its confidence interval, p-value (between each algorithm and the expert) and the calibration with HL are shown in Table 1. Conclusions All models achieved high AUROC values. The pairwise comparison of p-values in Table 1 demonstrates that the AUROCs of all the machine-learning models and the expert model were not significantly different. Nevertheless, EBMC is the best fitted. The model created by EBMC is shown in Figure 1. One limitation of the study is that the test dataset has low influenza prevalence, which may bias the detection algorithm performance. We are in the process of testing the algorithms using higher prevalence rate. The same process could also be applied to other diseases to further research the generalizability of our method.
Collapse
|