26
|
Cooper GF. Current research directions in the development of expert systems based on belief networks. ACTA ACUST UNITED AC 1989. [DOI: 10.1002/asm.3150050106] [Citation(s) in RCA: 38] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
|
36 |
38 |
27
|
Chapman WW, Cooper GF, Hanbury P, Chapman BE, Harrison LH, Wagner MM. Creating a text classifier to detect radiology reports describing mediastinal findings associated with inhalational anthrax and other disorders. J Am Med Inform Assoc 2003; 10:494-503. [PMID: 12807805 PMCID: PMC212787 DOI: 10.1197/jamia.m1330] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2003] [Accepted: 05/13/2003] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE The aim of this study was to create a classifier for automatic detection of chest radiograph reports consistent with the mediastinal findings of inhalational anthrax. DESIGN The authors used the Identify Patient Sets (IPS) system to create a key word classifier for detecting reports describing mediastinal findings consistent with anthrax and compared their performances on a test set of 79,032 chest radiograph reports. MEASUREMENTS Area under the ROC curve was the main outcome measure of the IPS classifier. Sensitivity and specificity of an initial IPS model were calculated based on an existing key word search and were compared against a Boolean version of the IPS classifier. RESULTS The IPS classifier received an area under the ROC curve of 0.677 (90% CI = 0.628 to 0.772) with a specificity of 0.99 and maximum sensitivity of 0.35. The initial IPS model attained a specificity of 1.0 and a sensitivity of 0.04. CONCLUSION The IPS system is a useful tool for helping domain experts create a statistical key word classifier for textual reports that is a potentially useful component in surveillance of radiographic findings suspicious for anthrax.
Collapse
|
Evaluation Study |
22 |
36 |
28
|
Gopalakrishnan V, Lustgarten JL, Visweswaran S, Cooper GF. Bayesian rule learning for biomedical data mining. ACTA ACUST UNITED AC 2010; 26:668-75. [PMID: 20080512 DOI: 10.1093/bioinformatics/btq005] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
MOTIVATION Disease state prediction from biomarker profiling studies is an important problem because more accurate classification models will potentially lead to the discovery of better, more discriminative markers. Data mining methods are routinely applied to such analyses of biomedical datasets generated from high-throughput 'omic' technologies applied to clinical samples from tissues or bodily fluids. Past work has demonstrated that rule models can be successfully applied to this problem, since they can produce understandable models that facilitate review of discriminative biomarkers by biomedical scientists. While many rule-based methods produce rules that make predictions under uncertainty, they typically do not quantify the uncertainty in the validity of the rule itself. This article describes an approach that uses a Bayesian score to evaluate rule models. RESULTS We have combined the expressiveness of rules with the mathematical rigor of Bayesian networks (BNs) to develop and evaluate a Bayesian rule learning (BRL) system. This system utilizes a novel variant of the K2 algorithm for building BNs from the training data to provide probabilistic scores for IF-antecedent-THEN-consequent rules using heuristic best-first search. We then apply rule-based inference to evaluate the learned models during 10-fold cross-validation performed two times. The BRL system is evaluated on 24 published 'omic' datasets, and on average it performs on par or better than other readily available rule learning methods. Moreover, BRL produces models that contain on average 70% fewer variables, which means that the biomarker panels for disease prediction contain fewer markers for further verification and validation by bench scientists.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
15 |
32 |
29
|
Cooper GF, Miller RA. An experiment comparing lexical and statistical methods for extracting MeSH terms from clinical free text. J Am Med Inform Assoc 1998; 5:62-75. [PMID: 9452986 PMCID: PMC61276 DOI: 10.1136/jamia.1998.0050062] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/1997] [Accepted: 09/17/1997] [Indexed: 02/06/2023] Open
Abstract
OBJECTIVE A primary goal of the University of Pittsburgh's 1990-94 UMLS-sponsored effort was to develop and evaluate PostDoc (a lexical indexing system) and Pindex (a statistical indexing system) comparatively, and then in combination as a hybrid system. Each system takes as input a portion of the free text from a narrative part of a patient's electronic medical record and returns a list of suggested MeSH terms to use in formulating a Medline search that includes concepts in the text. This paper describes the systems and reports an evaluation. The intent is for this evaluation to serve as a step toward the eventual realization of systems that assist healthcare personnel in using the electronic medical record to construct patient-specific searches of Medline. DESIGN The authors tested the performances of PostDoc, Pindex, and a hybrid system, using text taken from randomly selected clinical records, which were stratified to include six radiology reports, six pathology reports, and six discharge summaries. They identified concepts in the clinical records that might conceivably be used in performing a patient-specific Medline search. Each system was given the free text of each record as an input. The extent to which a system-derived list of MeSH terms captured the relevant concepts in these documents was determined based on blinded assessments by the authors. RESULTS PostDoc output a mean of approximately 19 MeSH terms per report, which included about 40% of the relevant report concepts. Pindex output a mean of approximately 57 terms per report and captured about 45% of the relevant report concepts. A hybrid system captured approximately 66% of the relevant concepts and output about 71 terms per report. CONCLUSION The outputs of PostDoc and Pindex are complementary in capturing MeSH terms from clinical free text. The results suggest possible approaches to reduce the number of terms output while maintaining the percentage of terms captured, including the use of UMLS semantic types to constrain the output list to contain only clinically relevant MeSH terms.
Collapse
|
research-article |
27 |
32 |
30
|
Jiang X, Cooper GF. A Bayesian spatio-temporal method for disease outbreak detection. J Am Med Inform Assoc 2010; 17:462-71. [PMID: 20595315 PMCID: PMC2995651 DOI: 10.1136/jamia.2009.000356] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2009] [Accepted: 04/27/2010] [Indexed: 11/04/2022] Open
Abstract
A system that monitors a region for a disease outbreak is called a disease outbreak surveillance system. A spatial surveillance system searches for patterns of disease outbreak in spatial subregions of the monitored region. A temporal surveillance system looks for emerging patterns of outbreak disease by analyzing how patterns have changed during recent periods of time. If a non-spatial, non-temporal system could be converted to a spatio-temporal one, the performance of the system might be improved in terms of early detection, accuracy, and reliability. A Bayesian network framework is proposed for a class of space-time surveillance systems called BNST. The framework is applied to a non-spatial, non-temporal disease outbreak detection system called PC in order to create the spatio-temporal system called PCTS. Differences in the detection performance of PC and PCTS are examined. The results show that the spatio-temporal Bayesian approach performs well, relative to the non-spatial, non-temporal approach.
Collapse
|
other |
15 |
30 |
31
|
Lustgarten JL, Visweswaran S, Gopalakrishnan V, Cooper GF. Application of an efficient Bayesian discretization method to biomedical data. BMC Bioinformatics 2011; 12:309. [PMID: 21798039 PMCID: PMC3162539 DOI: 10.1186/1471-2105-12-309] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2011] [Accepted: 07/28/2011] [Indexed: 12/16/2022] Open
Abstract
Background Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD) method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI) discretization method, which is commonly used for discretization. Results On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI. Conclusions On a range of biomedical datasets, a Bayesian discretization method (EBD) yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
14 |
29 |
32
|
Cooper GF, Fried J. Carbon-13 nuclear magnetic resonance spectra of prostaglandins and some prostaglandin analogs. Proc Natl Acad Sci U S A 1973; 70:1579-84. [PMID: 4514326 PMCID: PMC433546 DOI: 10.1073/pnas.70.5.1579] [Citation(s) in RCA: 28] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
High-resolution pulsed Fourier-transform nuclear magnetic resonance spectroscopy at 22.63 MHz was used to observe the proton-decoupled natural-abundance (13)C nuclear magnetic resonance spectra of CDCl(3) solutions of the methyl esters of prostaglandins F(1alpha), 15-epi-F(1alpha), F(2alpha), F(2beta), E(2), A(2), 13-dehydro-F(2alpha), 13-dehydro-F(3alpha), two intermediates on the synthetic pathway to 13-dehydro-PGF(3alpha), and of 7-oxa-PGF(1alpha). All resonances were assigned by chemical shift comparisons and single-frequency offresonance proton decoupling. With two exceptions, all the lines of the spectra are well-resolved single-carbon resonances. Those due to the cyclopentane and vinyl carbons are most sensitive to structural changes. Some of these effects can be rationalized in terms of the preferred conformations of the molecules.
Collapse
|
research-article |
52 |
28 |
33
|
Jiang X, Wallstrom G, Cooper GF, Wagner MM. Bayesian prediction of an epidemic curve. J Biomed Inform 2008; 42:90-9. [PMID: 18593605 DOI: 10.1016/j.jbi.2008.05.013] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2008] [Revised: 05/23/2008] [Accepted: 05/30/2008] [Indexed: 11/17/2022]
Abstract
An epidemic curve is a graph in which the number of new cases of an outbreak disease is plotted against time. Epidemic curves are ordinarily constructed after the disease outbreak is over. However, a good estimate of the epidemic curve early in an outbreak would be invaluable to health care officials. Currently, techniques for predicting the severity of an outbreak are very limited. As far as predicting the number of future cases, ordinarily epidemiologists simply make an educated guess as to how many people might become affected. We develop a model for estimating an epidemic curve early in an outbreak, and we show results of experiments testing its accuracy.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
17 |
27 |
34
|
Montefusco DJ, Chen L, Matmati N, Lu S, Newcomb B, Cooper GF, Hannun YA, Lu X. Distinct signaling roles of ceramide species in yeast revealed through systematic perturbation and systems biology analyses. Sci Signal 2013; 6:rs14. [PMID: 24170935 PMCID: PMC3974757 DOI: 10.1126/scisignal.2004515] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Ceramide, the central molecule of sphingolipid metabolism, is an important bioactive molecule that participates in various cellular regulatory events and that has been implicated in disease. Deciphering ceramide signaling is challenging because multiple ceramide species exist, and many of them may have distinct functions. We applied systems biology and molecular approaches to perturb ceramide metabolism in the yeast Saccharomyces cerevisiae and inferred causal relationships between ceramide species and their potential targets by combining lipidomic, genomic, and transcriptomic analyses. We found that during heat stress, distinct metabolic mechanisms controlled the abundance of different groups of ceramide species and provided experimental support for the importance of the dihydroceramidase Ydc1 in mediating the decrease in dihydroceramides during heat stress. Additionally, distinct groups of ceramide species, with different N-acyl chains and hydroxylations, regulated different sets of functionally related genes, indicating that the structural complexity of these lipids produces functional diversity. The transcriptional modules that we identified provide a resource to begin to dissect the specific functions of ceramides.
Collapse
|
Research Support, N.I.H., Extramural |
12 |
27 |
35
|
Dara J, Dowling JN, Travers D, Cooper GF, Chapman WW. Evaluation of preprocessing techniques for chief complaint classification. J Biomed Inform 2007; 41:613-23. [PMID: 18166502 DOI: 10.1016/j.jbi.2007.11.004] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2007] [Revised: 11/08/2007] [Accepted: 11/19/2007] [Indexed: 11/28/2022]
Abstract
OBJECTIVE To determine whether preprocessing chief complaints before automatically classifying them into syndromic categories improves classification performance. METHODS We preprocessed chief complaints using two preprocessors (CCP and EMT-P) and evaluated whether classification performance increased for a probabilistic classifier (CoCo) or for a keyword-based classifier (modification of the NYC Department of Health and Mental Hygiene chief complaint coder (KC)). RESULTS CCP exhibited high accuracy (85%) in preprocessing chief complaints but only slightly improved CoCo's classification performance for a few syndromes. EMT-P, which splits chief complaints into multiple problems, substantially increased CoCo's sensitivity for all syndromes. Preprocessing with CCP or EMT-P only improved KC's sensitivity for the Constitutional syndrome. CONCLUSION Evaluation of preprocessing systems should not be limited to accuracy of the preprocessor but should include the effect of preprocessing on syndromic classification. Splitting chief complaints into multiple problems before classification is important for CoCo, but other preprocessing steps only slightly improved classification performance for CoCo and a keyword-based classifier.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
18 |
24 |
36
|
Jiang X, Barmada MM, Cooper GF, Becich MJ. A bayesian method for evaluating and discovering disease loci associations. PLoS One 2011; 6:e22075. [PMID: 21853025 PMCID: PMC3154195 DOI: 10.1371/journal.pone.0022075] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2011] [Accepted: 06/14/2011] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND A genome-wide association study (GWAS) typically involves examining representative SNPs in individuals from some population. A GWAS data set can concern a million SNPs and may soon concern billions. Researchers investigate the association of each SNP individually with a disease, and it is becoming increasingly commonplace to also analyze multi-SNP associations. Techniques for handling so many hypotheses include the Bonferroni correction and recently developed bayesian methods. These methods can encounter problems. Most importantly, they are not applicable to a complex multi-locus hypothesis which has several competing hypotheses rather than only a null hypothesis. A method that computes the posterior probability of complex hypotheses is a pressing need. METHODOLOGY/FINDINGS We introduce the bayesian network posterior probability (BNPP) method which addresses the difficulties. The method represents the relationship between a disease and SNPs using a directed acyclic graph (DAG) model, and computes the likelihood of such models using a bayesian network scoring criterion. The posterior probability of a hypothesis is computed based on the likelihoods of all competing hypotheses. The BNPP can not only be used to evaluate a hypothesis that has previously been discovered or suspected, but also to discover new disease loci associations. The results of experiments using simulated and real data sets are presented. Our results concerning simulated data sets indicate that the BNPP exhibits both better evaluation and discovery performance than does a p-value based method. For the real data sets, previous findings in the literature are confirmed and additional findings are found. CONCLUSIONS/SIGNIFICANCE We conclude that the BNPP resolves a pressing problem by providing a way to compute the posterior probability of complex multi-locus hypotheses. A researcher can use the BNPP to determine the expected utility of investigating a hypothesis further. Furthermore, we conclude that the BNPP is a promising method for discovering disease loci associations.
Collapse
|
Research Support, N.I.H., Extramural |
14 |
24 |
37
|
Cai C, Cooper GF, Lu KN, Ma X, Xu S, Zhao Z, Chen X, Xue Y, Lee AV, Clark N, Chen V, Lu S, Chen L, Yu L, Hochheiser HS, Jiang X, Wang QJ, Lu X. Systematic discovery of the functional impact of somatic genome alterations in individual tumors through tumor-specific causal inference. PLoS Comput Biol 2019; 15:e1007088. [PMID: 31276486 PMCID: PMC6650088 DOI: 10.1371/journal.pcbi.1007088] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Revised: 07/23/2019] [Accepted: 05/09/2019] [Indexed: 02/07/2023] Open
Abstract
Cancer is mainly caused by somatic genome alterations (SGAs). Precision oncology involves identifying and targeting tumor-specific aberrations resulting from causative SGAs. We developed a novel tumor-specific computational framework that finds the likely causative SGAs in an individual tumor and estimates their impact on oncogenic processes, which suggests the disease mechanisms that are acting in that tumor. This information can be used to guide precision oncology. We report a tumor-specific causal inference (TCI) framework, which estimates causative SGAs by modeling causal relationships between SGAs and molecular phenotypes (e.g., transcriptomic, proteomic, or metabolomic changes) within an individual tumor. We applied the TCI algorithm to tumors from The Cancer Genome Atlas (TCGA) and estimated for each tumor the SGAs that causally regulate the differentially expressed genes (DEGs) in that tumor. Overall, TCI identified 634 SGAs that are predicted to cause cancer-related DEGs in a significant number of tumors, including most of the previously known drivers and many novel candidate cancer drivers. The inferred causal relationships are statistically robust and biologically sensible, and multiple lines of experimental evidence support the predicted functional impact of both the well-known and the novel candidate drivers that are predicted by TCI. TCI provides a unified framework that integrates multiple types of SGAs and molecular phenotypes to estimate which genome perturbations are causally influencing one or more molecular/cellular phenotypes in an individual tumor. By identifying major candidate drivers and revealing their functional impact in an individual tumor, TCI sheds light on the disease mechanisms of that tumor, which can serve to advance our basic knowledge of cancer biology and to support precision oncology that provides tailored treatment of individual tumors.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
6 |
21 |
38
|
Hogan WR, Cooper GF, Wallstrom GL, Wagner MM, Depinay JM. The Bayesian aerosol release detector: An algorithm for detecting and characterizing outbreaks caused by an atmospheric release ofBacillus anthracis. Stat Med 2007; 26:5225-52. [DOI: 10.1002/sim.3093] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
|
18 |
20 |
39
|
Suermondt HJ, Cooper GF. An evaluation of explanations of probabilistic inference. COMPUTERS AND BIOMEDICAL RESEARCH, AN INTERNATIONAL JOURNAL 1993; 26:242-54. [PMID: 8325004 DOI: 10.1006/cbmr.1993.1017] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Providing explanations of the conclusions of decision-support systems can be viewed as presenting inference results in a manner that enhances the user's insight into how these results were obtained. The ability to explain inferences has been demonstrated to be an important factor in making medical decision-support systems acceptable for clinical use. Although many researchers in artificial intelligence have explored the automatic generation of explanations for decision-support systems based on symbolic reasoning, research in automated explanation of probabilistic results has been limited. We present the results of an evaluation study of INSITE, a program that explains the reasoning of decision-support systems based on Bayesian belief networks. In the domain of anesthesia, we compared subjects who had access to a belief network with explanations of the inference results to control subjects who used the same belief network without explanations. We show that, compared to control subjects, the explanation subjects demonstrated greater diagnostic accuracy, were more confident about their conclusions, were more critical of the belief network, and found the presentation of the inference results more clear.
Collapse
|
|
32 |
19 |
40
|
Suermondt H, Cooper GF. Probabilistic inference in multiply connected belief networks using loop cutsets. Int J Approx Reason 1990. [DOI: 10.1016/0888-613x(90)90003-k] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
|
35 |
19 |
41
|
Cooper GF, Villamarin R, Rich Tsui FC, Millett N, Espino JU, Wagner MM. A method for detecting and characterizing outbreaks of infectious disease from clinical reports. J Biomed Inform 2014; 53:15-26. [PMID: 25181466 DOI: 10.1016/j.jbi.2014.08.011] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2014] [Revised: 08/04/2014] [Accepted: 08/22/2014] [Indexed: 11/30/2022]
Abstract
Outbreaks of infectious disease can pose a significant threat to human health. Thus, detecting and characterizing outbreaks quickly and accurately remains an important problem. This paper describes a Bayesian framework that links clinical diagnosis of individuals in a population to epidemiological modeling of disease outbreaks in the population. Computer-based diagnosis of individuals who seek healthcare is used to guide the search for epidemiological models of population disease that explain the pattern of diagnoses well. We applied this framework to develop a system that detects influenza outbreaks from emergency department (ED) reports. The system diagnoses influenza in individuals probabilistically from evidence in ED reports that are extracted using natural language processing. These diagnoses guide the search for epidemiological models of influenza that explain the pattern of diagnoses well. Those epidemiological models with a high posterior probability determine the most likely outbreaks of specific diseases; the models are also used to characterize properties of an outbreak, such as its expected peak day and estimated size. We evaluated the method using both simulated data and data from a real influenza outbreak. The results provide support that the approach can detect and characterize outbreaks early and well enough to be valuable. We describe several extensions to the approach that appear promising.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
11 |
19 |
42
|
Cooper GF, Bahar I, Becich MJ, Benos PV, Berg J, Espino JU, Glymour C, Jacobson RC, Kienholz M, Lee AV, Lu X, Scheines R. The center for causal discovery of biomedical knowledge from big data. J Am Med Inform Assoc 2015; 22:1132-6. [PMID: 26138794 PMCID: PMC5009908 DOI: 10.1093/jamia/ocv059] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Revised: 04/27/2015] [Accepted: 05/02/2015] [Indexed: 01/12/2023] Open
Abstract
The Big Data to Knowledge (BD2K) Center for Causal Discovery is developing and disseminating an integrated set of open source tools that support causal modeling and discovery of biomedical knowledge from large and complex biomedical datasets. The Center integrates teams of biomedical and data scientists focused on the refinement of existing and the development of new constraint-based and Bayesian algorithms based on causal Bayesian networks, the optimization of software for efficient operation in a supercomputing environment, and the testing of algorithms and software developed using real data from 3 representative driving biomedical projects: cancer driver mutations, lung disease, and the functional connectome of the human brain. Associated training activities provide both biomedical and data scientists with the knowledge and skills needed to apply and extend these tools. Collaborative activities with the BD2K Consortium further advance causal discovery tools and integrate tools and resources developed by other centers.
Collapse
|
Research Support, N.I.H., Extramural |
10 |
19 |
43
|
Yoo C, Cooper GF. An evaluation of a system that recommends microarray experiments to perform to discover gene-regulation pathways. Artif Intell Med 2004; 31:169-82. [PMID: 15219293 DOI: 10.1016/j.artmed.2004.01.018] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2003] [Revised: 04/14/2003] [Accepted: 01/16/2004] [Indexed: 11/23/2022]
Abstract
The main topic of this paper is modeling the expected value of experimentation (EVE) for discovering causal pathways in gene expression data. By experimentation we mean both interventions (e.g., a gene knockout experiment) and observations (e.g., passively observing the expression level of a "wild-type" gene). We introduce a system called GEEVE (causal discovery in Gene Expression data using Expected Value of Experimentation), which implements expected value of experimentation in discovering causal pathways using gene expression data. GEEVE provides the following assistance, which is intended to help biologists in their quest to discover gene-regulation pathways: Recommending which experiments to perform (with a focus on "knockout" experiments) using an expected value of experimentation method. Recommending the number of measurements (observational and experimental) to include in the experimental design, again using an EVE method. Providing a Bayesian analysis that combines prior knowledge with the results of recent microarray experimental results to derive posterior probabilities of gene regulation relationships. In recommending which experiments to perform (and how many times to repeat them) the EVE approach considers the biologist's preferences for which genes to focus the discovery process. Also, since exact EVE calculations are exponential in time, GEEVE incorporates approximation methods. GEEVE is able to combine data from knockout experiments with data from wild-type experiments to suggest additional experiments to perform and then to analyze the results of those microarray experimental results. It models the possibility that unmeasured (latent) variables may be responsible for some of the statistical associations among the expression levels of the genes under study. To evaluate the GEEVE system, we used a gene expression simulator to generate data from specified models of gene regulation. The results show that the GEEVE system gives better results than two recently published approaches (1) in learning the generating models of gene regulation and (2) in recommending experiments to perform.
Collapse
|
Research Support, U.S. Gov't, P.H.S. |
21 |
17 |
44
|
King AJ, Cooper GF, Clermont G, Hochheiser H, Hauskrecht M, Sittig DF, Visweswaran S. Using machine learning to selectively highlight patient information. J Biomed Inform 2019; 100:103327. [PMID: 31676461 PMCID: PMC6932869 DOI: 10.1016/j.jbi.2019.103327] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Revised: 08/20/2019] [Accepted: 10/28/2019] [Indexed: 02/05/2023]
Abstract
BACKGROUND Electronic medical record (EMR) systems need functionality that decreases cognitive overload by drawing the clinician's attention to the right data, at the right time. We developed a Learning EMR (LEMR) system that learns statistical models of clinician information-seeking behavior and applies those models to direct the display of data in future patients. We evaluated the performance of the system in identifying relevant patient data in intensive care unit (ICU) patient cases. METHODS To capture information-seeking behavior, we enlisted critical care medicine physicians who reviewed a set of patient cases and selected data items relevant to the task of presenting at morning rounds. Using patient EMR data as predictors, we built machine learning models to predict their relevancy. We prospectively evaluated the predictions of a set of high performing models. RESULTS On an independent evaluation data set, 25 models achieved precision of 0.52, 95% CI [0.49, 0.54] and recall of 0.77, 95% CI [0.75, 0.80] in identifying relevant patient data items. For data items missed by the system, the reviewers rated the effect of not seeing those data from no impact to minor impact on patient care in about 82% of the cases. CONCLUSION Data-driven approaches for adaptively displaying data in EMR systems, like the LEMR system, show promise in using information-seeking behavior of clinicians to identify and highlight relevant patient data.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
16 |
45
|
Suermondt H, Cooper GF. Initialization for the method of conditioning in Bayesian belief networks. ARTIF INTELL 1991. [DOI: 10.1016/0004-3702(91)90091-w] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
|
34 |
15 |
46
|
Yoo C, Thorsson V, Cooper GF. Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2002:498-509. [PMID: 11928502 DOI: 10.1142/9789812799623_0046] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
This paper reports the methods and results of a computer-based search for causal relationships in the gene-regulation pathway of galactose metabolism in the yeast Saccharomyces cerevisiae. The search uses recently published data from cDNA microarray experiments. A Bayesian method was applied to learn causal networks from a mixture of observational and experimental gene-expression data. The observational data were gene-expression levels obtained from unmanipulated "wild-type" cells. The experimental data were produced by deleting ("knocking out") genes and observing the expression levels of other genes. Causal relations predicted from the analysis on 36 galactose gene pairs are reported and compared with the known galactose pathway. Additional exploratory analyses are also reported.
Collapse
|
|
23 |
13 |
47
|
Carpio H, Cooper GF, Edwards JA, Fried JH, Garay GL, Guzman A, Mendez JA, Muchowski JM, Roszkowski AP, Van Horn AR. Synthesis and gastric antisecretory properties of allenic 16-phenoxy-omega-tetranor prostaglandin E analogs. PROSTAGLANDINS 1987; 33:169-80. [PMID: 3588969 DOI: 10.1016/0090-6980(87)90004-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
In order to improve the modest oral activity of PGE2 as an inhibitor of gastric acid secretion, analogs were prepared and tested orally in histamine-challenged rats. Insertion of a double bond at C-4, resulting in the 4,5-allene analog of PGE1, gave a small increase in activity. Introduction of the omega-tetranor-16-phenoxy lower sidechain, a modification known to enhance activity in the PGF series, gave an eight-fold increase in activity. The analog having both modifications (enprostil, 2) showed a six hundred-fold increase in oral antisecretory activity over PGE2, which may reflect a potentiation effect. Modification of enprostil at C-1 (various esters) and at C-11 (11-methyl, 11-deoxy) generally resulted in compounds of high activity while modifications at other sites generally resulted in significant reductions in activity.
Collapse
|
|
38 |
13 |
48
|
|
|
54 |
13 |
49
|
Naeini MP, Cooper GF. Binary Classifier Calibration using an Ensemble of Near Isotonic Regression Models. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON DATA MINING 2017; 2016:360-369. [PMID: 28316511 PMCID: PMC5351887 DOI: 10.1109/icdm.2016.0047] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Learning accurate probabilistic models from data is crucial in many practical tasks in data mining. In this paper we present a new non-parametric calibration method called ensemble of near isotonic regression (ENIR). The method can be considered as an extension of BBQ [20], a recently proposed calibration method, as well as the commonly used calibration method based on isotonic regression (IsoRegC) [27]. ENIR is designed to address the key limitation of IsoRegC which is the monotonicity assumption of the predictions. Similar to BBQ, the method post-processes the output of a binary classifier to obtain calibrated probabilities. Thus it can be used with many existing classification models to generate accurate probabilistic predictions. We demonstrate the performance of ENIR on synthetic and real datasets for commonly applied binary classification models. Experimental results show that the method outperforms several common binary classifier calibration methods. In particular on the real data, ENIR commonly performs statistically significantly better than the other methods, and never worse. It is able to improve the calibration power of classifiers, while retaining their discrimination power. The method is also computationally tractable for large scale datasets, as it is O(N log N) time, where N is the number of samples.
Collapse
|
|
8 |
13 |
50
|
Ye Y, Wagner MM, Cooper GF, Ferraro JP, Su H, Gesteland PH, Haug PJ, Millett NE, Aronis JM, Nowalk AJ, Ruiz VM, López Pineda A, Shi L, Van Bree R, Ginter T, Tsui F. A study of the transferability of influenza case detection systems between two large healthcare systems. PLoS One 2017; 12:e0174970. [PMID: 28380048 PMCID: PMC5381795 DOI: 10.1371/journal.pone.0174970] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2016] [Accepted: 03/17/2017] [Indexed: 01/16/2023] Open
Abstract
Objectives This study evaluates the accuracy and transferability of Bayesian case detection systems (BCD) that use clinical notes from emergency department (ED) to detect influenza cases. Methods A BCD uses natural language processing (NLP) to infer the presence or absence of clinical findings from ED notes, which are fed into a Bayesain network classifier (BN) to infer patients’ diagnoses. We developed BCDs at the University of Pittsburgh Medical Center (BCDUPMC) and Intermountain Healthcare in Utah (BCDIH). At each site, we manually built a rule-based NLP and trained a Bayesain network classifier from over 40,000 ED encounters between Jan. 2008 and May. 2010 using feature selection, machine learning, and expert debiasing approach. Transferability of a BCD in this study may be impacted by seven factors: development (source) institution, development parser, application (target) institution, application parser, NLP transfer, BN transfer, and classification task. We employed an ANOVA analysis to study their impacts on BCD performance. Results Both BCDs discriminated well between influenza and non-influenza on local test cases (AUCs > 0.92). When tested for transferability using the other institution’s cases, BCDUPMC discriminations declined minimally (AUC decreased from 0.95 to 0.94, p<0.01), and BCDIH discriminations declined more (from 0.93 to 0.87, p<0.0001). We attributed the BCDIH decline to the lower recall of the IH parser on UPMC notes. The ANOVA analysis showed five significant factors: development parser, application institution, application parser, BN transfer, and classification task. Conclusion We demonstrated high influenza case detection performance in two large healthcare systems in two geographically separated regions, providing evidentiary support for the use of automated case detection from routinely collected electronic clinical notes in national influenza surveillance. The transferability could be improved by training Bayesian network classifier locally and increasing the accuracy of the NLP parser.
Collapse
|
Journal Article |
8 |
13 |