201
|
Beuther DA, Krishnan JA. Finding Asthma: Building a Foundation for Care and Discovery. Am J Respir Crit Care Med 2017; 196:401-402. [PMID: 28475356 DOI: 10.1164/rccm.201704-0840ed] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Affiliation(s)
| | - Jerry A Krishnan
- 2 University of Illinois Hospital & Health Sciences System Chicago, Illinois
| |
Collapse
|
202
|
Owusu Adjah ES, Montvida O, Agbeve J, Paul SK. Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus. ACTA ACUST UNITED AC 2017. [DOI: 10.2174/1875036201710010016] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:Identification of diseased patients from primary care based electronic medical records (EMRs) has methodological challenges that may impact epidemiologic inferences.Objective:To compare deterministic clinically guided selection algorithms with probabilistic machine learning (ML) methodologies for their ability to identify patients with type 2 diabetes mellitus (T2DM) from large population based EMRs from nationally representative primary care database.Methods:Four cohorts of patients with T2DM were defined by deterministic approach based on disease codes. The database was mined for a set of best predictors of T2DM and the performance of six ML algorithms were compared based on cross-validated true positive rate, true negative rate, and area under receiver operating characteristic curve.Results:In the database of 11,018,025 research suitable individuals, 379 657 (3.4%) were coded to have T2DM. Logistic Regression classifier was selected as best ML algorithm and resulted in a cohort of 383,330 patients with potential T2DM. Eighty-three percent (83%) of this cohort had a T2DM code, and 16% of the patients with T2DM code were not included in this ML cohort. Of those in the ML cohort without disease code, 52% had at least one measure of elevated glucose level and 22% had received at least one prescription for antidiabetic medication.Conclusion:Deterministic cohort selection based on disease coding potentially introduces significant mis-classification problem. ML techniques allow testing for potential disease predictors, and under meaningful data input, are able to identify diseased cohorts in a holistic way.
Collapse
|
203
|
Wang Z, Li L, Glicksberg BS, Israel A, Dudley JT, Ma'ayan A. Predicting age by mining electronic medical records with deep learning characterizes differences between chronological and physiological age. J Biomed Inform 2017; 76:59-68. [PMID: 29113935 PMCID: PMC5716867 DOI: 10.1016/j.jbi.2017.11.003] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Revised: 10/28/2017] [Accepted: 11/04/2017] [Indexed: 02/08/2023]
Abstract
Determining the discrepancy between chronological and physiological age of patients is central to preventative and personalized care. Electronic medical records (EMR) provide rich information about the patient physiological state, but it is unclear whether such information can be predictive of chronological age. Here we present a deep learning model that uses vital signs and lab tests contained within the EMR of Mount Sinai Health System (MSHS) to predict chronological age. The model is trained on 377,686 EMR from patients of ages 18-85 years old. The discrepancy between the predicted and real chronological age is then used as a proxy to estimate physiological age. Overall, the model can predict the chronological age of patients with a standard deviation error of ∼7 years. The ages of the youngest and oldest patients were more accurately predicted, while patients of ages ranging between 40 and 60 years were the least accurately predicted. Patients with the largest discrepancy between their physiological and chronological age were further inspected. The patients predicted to be significantly older than their chronological age have higher systolic blood pressure, higher cholesterol, damaged liver, and anemia. In contrast, patients predicted to be younger than their chronological age have lower blood pressure and shorter stature among other indicators; both groups display lower weight than the population average. Using information from ∼10,000 patients from the entire cohort who have been also profiled with SNP arrays, genome-wide association study (GWAS) uncovers several novel genetic variants associated with aging. In particular, significant variants were mapped to genes known to be associated with inflammation, hypertension, lipid metabolism, height, and increased lifespan in mice. Several genes with missense mutations were identified as novel candidate aging genes. In conclusion, we demonstrate how EMR data can be used to assess overall health via a scale that is based on deviation from the patient's predicted chronological age.
Collapse
Affiliation(s)
- Zichen Wang
- Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA
| | - Li Li
- Department of Genetics and Genomic Sciences, Institute of Next Generation Healthcare, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA
| | - Benjamin S Glicksberg
- Department of Genetics and Genomic Sciences, Institute of Next Generation Healthcare, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA
| | - Ariel Israel
- Department of Family Medicine, Clalit Health Services, Jerusalem 90258, Israel
| | - Joel T Dudley
- Department of Genetics and Genomic Sciences, Institute of Next Generation Healthcare, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA
| | - Avi Ma'ayan
- Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, New York, NY 10029, USA.
| |
Collapse
|
204
|
Chen J, Wei W, Guo C, Tang L, Sun L. Textual analysis and visualization of research trends in data mining for electronic health records. HEALTH POLICY AND TECHNOLOGY 2017. [DOI: 10.1016/j.hlpt.2017.10.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
205
|
Mikalsen KØ, Soguero-Ruiz C, Jensen K, Hindberg K, Gran M, Revhaug A, Lindsetmo RO, Skrøvseth SO, Godtliebsen F, Jenssen R. Using anchors from free text in electronic health records to diagnose postoperative delirium. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 152:105-114. [PMID: 29054250 DOI: 10.1016/j.cmpb.2017.09.014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2017] [Revised: 09/05/2017] [Accepted: 09/15/2017] [Indexed: 06/07/2023]
Abstract
OBJECTIVES Postoperative delirium is a common complication after major surgery among the elderly. Despite its potentially serious consequences, the complication often goes undetected and undiagnosed. In order to provide diagnosis support one could potentially exploit the information hidden in free text documents from electronic health records using data-driven clinical decision support tools. However, these tools depend on labeled training data and can be both time consuming and expensive to create. METHODS The recent learning with anchors framework resolves this problem by transforming key observations (anchors) into labels. This is a promising framework, but it is heavily reliant on clinicians knowledge for specifying good anchor choices in order to perform well. In this paper we propose a novel method for specifying anchors from free text documents, following an exploratory data analysis approach based on clustering and data visualization techniques. We investigate the use of the new framework as a way to detect postoperative delirium. RESULTS By applying the proposed method to medical data gathered from a Norwegian university hospital, we increase the area under the precision-recall curve from 0.51 to 0.96 compared to baselines. CONCLUSIONS The proposed approach can be used as a framework for clinical decision support for postoperative delirium.
Collapse
Affiliation(s)
- Karl Øyvind Mikalsen
- Department of Mathematics and Statistics, UiT The Arctic University of Norway, Tromsø, Norway; UiT Machine Learning Group, Norway.
| | - Cristina Soguero-Ruiz
- UiT Machine Learning Group, Norway; Department of Signal Theory and Comm., Telematics and Computing, Universidad Rey Juan Carlos, Fuenlabrada, Spain
| | - Kasper Jensen
- Norwegian Centre for E-health Research, University Hospital of North Norway (UNN), Tromsø, Norway
| | - Kristian Hindberg
- Department of Mathematics and Statistics, UiT The Arctic University of Norway, Tromsø, Norway
| | - Mads Gran
- Department of Gastrointestinal Surgery, UNN, Tromsø, Norway
| | - Arthur Revhaug
- Department of Gastrointestinal Surgery, UNN, Tromsø, Norway; Clinic for Surgery, Cancer and Women's Health, UNN, Tromsø, Norway; Institute of Clinical Medicine, UiT, Tromsø, Norway
| | - Rolv-Ole Lindsetmo
- Department of Gastrointestinal Surgery, UNN, Tromsø, Norway; Institute of Clinical Medicine, UiT, Tromsø, Norway
| | - Stein Olav Skrøvseth
- Department of Mathematics and Statistics, UiT The Arctic University of Norway, Tromsø, Norway; Norwegian Centre for E-health Research, University Hospital of North Norway (UNN), Tromsø, Norway
| | - Fred Godtliebsen
- Department of Mathematics and Statistics, UiT The Arctic University of Norway, Tromsø, Norway
| | - Robert Jenssen
- Department of Physics and Technology, UiT, Tromsø, Norway; Norwegian Centre for E-health Research, University Hospital of North Norway (UNN), Tromsø, Norway; UiT Machine Learning Group, Norway
| |
Collapse
|
206
|
Esteban S, Rodríguez Tablado M, Peper FE, Mahumud YS, Ricci RI, Kopitowski KS, Terrasa SA. Development and validation of various phenotyping algorithms for Diabetes Mellitus using data from electronic health records. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 152:53-70. [PMID: 29054261 DOI: 10.1016/j.cmpb.2017.09.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2017] [Revised: 08/19/2017] [Accepted: 09/13/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND AND OBJECTIVE Recent progression towards precision medicine has encouraged the use of electronic health records (EHRs) as a source for large amounts of data, which is required for studying the effect of treatments or risk factors in more specific subpopulations. Phenotyping algorithms allow to automatically classify patients according to their particular electronic phenotype thus facilitating the setup of retrospective cohorts. Our objective is to compare the performance of different classification strategies (only using standardized problems, rule-based algorithms, statistical learning algorithms (six learners) and stacked generalization (five versions)), for the categorization of patients according to their diabetic status (diabetics, not diabetics and inconclusive; Diabetes of any type) using information extracted from EHRs. METHODS Patient information was extracted from the EHR at Hospital Italiano de Buenos Aires, Buenos Aires, Argentina. For the derivation and validation datasets, two probabilistic samples of patients from different years (2005: n = 1663; 2015: n = 800) were extracted. The only inclusion criterion was age (≥40 & <80 years). Four researchers manually reviewed all records and classified patients according to their diabetic status (diabetic: diabetes registered as a health problem or fulfilling the ADA criteria; non-diabetic: not fulfilling the ADA criteria and having at least one fasting glycemia below 126 mg/dL; inconclusive: no data regarding their diabetic status or only one abnormal value). The best performing algorithms within each strategy were tested on the validation set. RESULTS The standardized codes algorithm achieved a Kappa coefficient value of 0.59 (95% CI 0.49, 0.59) in the validation set. The Boolean logic algorithm reached 0.82 (95% CI 0.76, 0.88). A slightly higher value was achieved by the Feedforward Neural Network (0.9, 95% CI 0.85, 0.94). The best performing learner was the stacked generalization meta-learner that reached a Kappa coefficient value of 0.95 (95% CI 0.91, 0.98). CONCLUSIONS The stacked generalization strategy and the feedforward neural network showed the best classification metrics in the validation set. The implementation of these algorithms enables the exploitation of the data of thousands of patients accurately.
Collapse
Affiliation(s)
- Santiago Esteban
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.; Research Department, Instituto Universitario Hospital Italiano de Buenos Aires, Buenos Aires, Argentina..
| | | | - Francisco E Peper
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| | - Yamila S Mahumud
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| | - Ricardo I Ricci
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| | - Karin S Kopitowski
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.; Research Department, Instituto Universitario Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| | - Sergio A Terrasa
- Family and Community Division, Hospital Italiano de Buenos Aires, Buenos Aires, Argentina.; Public Health Department, Instituto Universitario Hospital Italiano de Buenos Aires, Buenos Aires, Argentina
| |
Collapse
|
207
|
Escudié JB, Rance B, Malamut G, Khater S, Burgun A, Cellier C, Jannot AS. A novel data-driven workflow combining literature and electronic health records to estimate comorbidities burden for a specific disease: a case study on autoimmune comorbidities in patients with celiac disease. BMC Med Inform Decis Mak 2017; 17:140. [PMID: 28962565 PMCID: PMC5622531 DOI: 10.1186/s12911-017-0537-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Accepted: 09/12/2017] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Data collected in EHRs have been widely used to identifying specific conditions; however there is still a need for methods to define comorbidities and sources to identify comorbidities burden. We propose an approach to assess comorbidities burden for a specific disease using the literature and EHR data sources in the case of autoimmune diseases in celiac disease (CD). METHODS We generated a restricted set of comorbidities using the literature (via the MeSH® co-occurrence file). We extracted the 15 most co-occurring autoimmune diseases of the CD. We used mappings of the comorbidities to EHR terminologies: ICD-10 (billing codes), ATC (drugs) and UMLS (clinical reports). Finally, we extracted the concepts from the different data sources. We evaluated our approach using the correlation between prevalence estimates in our cohort and co-occurrence ranking in the literature. RESULTS We retrieved the comorbidities for 741 patients with CD. 18.1% of patients had at least one of the 15 studied autoimmune disorders. Overall, 79.3% of the mapped concepts were detected only in text, 5.3% only in ICD codes and/or drugs prescriptions, and 15.4% could be found in both sources. Prevalence in our cohort were correlated with literature (Spearman's coefficient 0.789, p = 0.0005). The three most prevalent comorbidities were thyroiditis 12.6% (95% CI 10.1-14.9), type 1 diabetes 2.3% (95% CI 1.2-3.4) and dermatitis herpetiformis 2.0% (95% CI 1.0-3.0). CONCLUSION We introduced a process that leveraged the MeSH terminology to identify relevant autoimmune comorbidities of the CD and several data sources from EHRs to phenotype a large population of CD patients. We achieved prevalence estimates comparable to the literature.
Collapse
Affiliation(s)
- Jean-Baptiste Escudié
- Georges Pompidou European Hospital (HEGP), AP-HP, Paris, France
- INSERM UMRS 1138, Paris Descartes University, Paris, France
- Pôle Informatique Médicale et Santé Publique, Hôpital Européen Georges Pompidou, 20 rue Leblanc, 75015 Paris, France
| | - Bastien Rance
- Georges Pompidou European Hospital (HEGP), AP-HP, Paris, France
- INSERM UMRS 1138, Paris Descartes University, Paris, France
| | - Georgia Malamut
- Georges Pompidou European Hospital (HEGP), AP-HP, Paris, France
| | - Sherine Khater
- Georges Pompidou European Hospital (HEGP), AP-HP, Paris, France
| | - Anita Burgun
- Georges Pompidou European Hospital (HEGP), AP-HP, Paris, France
- INSERM UMRS 1138, Paris Descartes University, Paris, France
| | | | - Anne-Sophie Jannot
- Georges Pompidou European Hospital (HEGP), AP-HP, Paris, France
- INSERM UMRS 1138, Paris Descartes University, Paris, France
| |
Collapse
|
208
|
Gustafson E, Pacheco J, Wehbe F, Silverberg J, Thompson W. A Machine Learning Algorithm for Identifying Atopic Dermatitis in Adults from Electronic Health Records. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2017; 2017:83-90. [PMID: 29104964 DOI: 10.1109/ichi.2017.31] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
The current work aims to identify patients with atopic dermatitis for inclusion in genome-wide association studies (GWAS). Here we describe a machine learning-based phenotype algorithm. Using the electronic health record (EHR), we combined coded information with information extracted from encounter notes as features in a lasso logistic regression. Our algorithm achieves high positive predictive value (PPV) and sensitivity, improving on previous algorithms with low sensitivity. These results demonstrate the utility of natural language processing (NLP) and machine learning for EHR-based phenotyping.
Collapse
Affiliation(s)
- Erin Gustafson
- Feinberg School of Medicine, Northwestern University, Chicago, Illinois 60611
| | - Jennifer Pacheco
- Feinberg School of Medicine, Northwestern University, Chicago, Illinois 60611
| | - Firas Wehbe
- Feinberg School of Medicine, Northwestern University, Chicago, Illinois 60611
| | - Jonathan Silverberg
- Feinberg School of Medicine, Northwestern University, Chicago, Illinois 60611
| | - William Thompson
- Feinberg School of Medicine, Northwestern University, Chicago, Illinois 60611
| |
Collapse
|
209
|
Schlegel DR, Ficheur G. Secondary Use of Patient Data: Review of the Literature Published in 2016. Yearb Med Inform 2017; 26:68-71. [PMID: 29063536 DOI: 10.15265/iy-2017-032] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Objectives: To summarize recent research and emerging trends in the area of secondary use of healthcare data, and to present the best papers published in this field, selected to appear in the 2017 edition of the IMIA Yearbook. Methods: A literature review of articles published in 2016 and related to secondary use of healthcare data was performed using two bibliographic databases. From this search, 941 papers were identified. The section editors independently reviewed the papers for relevancy and impact, resulting in a consensus list of 14 candidate best papers. External reviewers examined each of the candidate best papers and the final selection was made by the editorial board of the Yearbook. Results: From the 941 retrieved papers, the selection process resulted in four best papers. These papers discuss data quality concerns, issues in preserving privacy of patients in shared datasets, and methods of decision support when consuming large amounts of raw electronic health record (EHR) data. Conclusion: In 2016, a significant effort was put into the development of new systems which aim to avoid significant human understanding and pre-processing of healthcare data, though this is still only an emerging area of research. The value of temporal relationships between data received significant study, as did effective information sharing while preserving patient privacy.
Collapse
|
210
|
Blecker S, Sontag D, Horwitz LI, Kuperman G, Park H, Reyentovich A, Katz SD. Early Identification of Patients With Acute Decompensated Heart Failure. J Card Fail 2017; 24:357-362. [PMID: 28887109 DOI: 10.1016/j.cardfail.2017.08.458] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2017] [Revised: 08/16/2017] [Accepted: 08/25/2017] [Indexed: 11/26/2022]
Abstract
BACKGROUND Interventions to reduce readmissions after acute heart failure hospitalization require early identification of patients. The purpose of this study was to develop and test accuracies of various approaches to identify patients with acute decompensated heart failure (ADHF) with the use of data derived from the electronic health record. METHODS AND RESULTS We included 37,229 hospitalizations of adult patients at a single hospital during 2013-2015. We developed 4 algorithms to identify hospitalization with a principal discharge diagnosis of ADHF: 1) presence of 1 of 3 clinical characteristics, 2) logistic regression of 31 structured data elements, 3) machine learning with unstructured data, and 4) machine learning with the use of both structured and unstructured data. In data validation, algorithm 1 had a sensitivity of 0.98 and positive predictive value (PPV) of 0.14 for ADHF. Algorithm 2 had an area under the receiver operating characteristic curve (AUC) of 0.96, and both machine learning algorithms had AUCs of 0.99. Based on a brief survey of 3 providers who perform chart review for ADHF, we estimated that providers spent 8.6 minutes per chart review; using this this parameter, we estimated that providers would spend 61.4, 57.3, 28.7, and 25.3 minutes on secondary chart review for each case of ADHF if initial screening were done with algorithms 1, 2, 3, and 4, respectively. CONCLUSIONS Machine learning algorithms with unstructured notes had the best performance for identification of ADHF and can improve provider efficiency for delivery of quality improvement interventions.
Collapse
Affiliation(s)
- Saul Blecker
- Department of Population Health, New York Univeristy School of Medicine, New York, New York; Department of Medicine, New York Univeristy School of Medicine, New York, New York.
| | - David Sontag
- Department of Computer Science, New York University, New York, New York
| | - Leora I Horwitz
- Department of Population Health, New York Univeristy School of Medicine, New York, New York; Department of Medicine, New York Univeristy School of Medicine, New York, New York
| | | | - Hannah Park
- Department of Population Health, New York Univeristy School of Medicine, New York, New York
| | - Alex Reyentovich
- Department of Medicine, New York Univeristy School of Medicine, New York, New York
| | - Stuart D Katz
- Department of Medicine, New York Univeristy School of Medicine, New York, New York
| |
Collapse
|
211
|
Chakrabarti S, Sen A, Huser V, Hruby GW, Rusanov A, Albers DJ, Weng C. An Interoperable Similarity-based Cohort Identification Method Using the OMOP Common Data Model version 5.0. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2017; 1:1-18. [PMID: 28776047 DOI: 10.1007/s41666-017-0005-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Cohort identification for clinical studies tends to be laborious, time-consuming, and expensive. Developing automated or semi-automated methods for cohort identification is one of the "holy grails" in the field of biomedical informatics. We propose a high-throughput similarity-based cohort identification algorithm by applying numerical abstractions on Electronic Health Records (EHR) data. We implement this algorithm using the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), which enables sites using this standardized EHR data representation to avail this algorithm with minimum effort for local implementation. We validate its performance for a retrospective cohort identification task on six clinical trials conducted at the Columbia University Medical Center. Our algorithm achieves an average Area Under the Curve (AUC) of 0.966 and an average Precision at 5 of 0.983. This interoperable method promises to achieve efficient cohort identification in EHR databases. We discuss suitable applications of our method and its limitations and propose warranted future work.
Collapse
Affiliation(s)
- Shreya Chakrabarti
- Department of Biomedical Informatics, Columbia University, New York NY 10032
| | - Anando Sen
- Department of Biomedical Informatics, Columbia University, New York NY 10032
| | - Vojtech Huser
- National Institute of Health, National Library of Medicine, Bethesda, MD 20892
| | - Gregory W Hruby
- Department of Biomedical Informatics, Columbia University, New York NY 10032
| | - Alexander Rusanov
- Department of Anesthesiology, Columbia University, New York NY 10032
| | - David J Albers
- Department of Biomedical Informatics, Columbia University, New York NY 10032
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York NY 10032
| |
Collapse
|
212
|
El Naqa I, Kerns SL, Coates J, Luo Y, Speers C, West CML, Rosenstein BS, Ten Haken RK. Radiogenomics and radiotherapy response modeling. Phys Med Biol 2017; 62:R179-R206. [PMID: 28657906 PMCID: PMC5557376 DOI: 10.1088/1361-6560/aa7c55] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Advances in patient-specific information and biotechnology have contributed to a new era of computational medicine. Radiogenomics has emerged as a new field that investigates the role of genetics in treatment response to radiation therapy. Radiation oncology is currently attempting to embrace these recent advances and add to its rich history by maintaining its prominent role as a quantitative leader in oncologic response modeling. Here, we provide an overview of radiogenomics starting with genotyping, data aggregation, and application of different modeling approaches based on modifying traditional radiobiological methods or application of advanced machine learning techniques. We highlight the current status and potential for this new field to reshape the landscape of outcome modeling in radiotherapy and drive future advances in computational oncology.
Collapse
Affiliation(s)
- Issam El Naqa
- Department of Radiation Oncology, University of Michigan, Ann Arbor, MI, United States of America
| | | | | | | | | | | | | | | |
Collapse
|
213
|
Abstract
OBJECTIVES To summarize significant developments in Clinical Research Informatics (CRI) over the past two years and discuss future directions. METHODS Survey of advances, open problems and opportunities in this field based on exploration of current literature. RESULTS Recent advances are structured according to three use cases of clinical research: Protocol feasibility, patient identification/ recruitment and clinical trial execution. DISCUSSION CRI is an evolving, dynamic field of research. Global collaboration, open metadata, content standards with semantics and computable eligibility criteria are key success factors for future developments in CRI.
Collapse
Affiliation(s)
- M Dugas
- Prof. Dr. Martin Dugas, Institute of Medical Informatics, University of Münster, Albert-Schweitzer-Campus 1
- A11, D-48149 Münster, Germany, Tel: +49 251 83 55262, E-mail:
| |
Collapse
|
214
|
Clifton DA, Niehaus KE, Charlton P, Colopy GW. Health Informatics via Machine Learning for the Clinical Management of Patients. Yearb Med Inform 2017; 10:38-43. [PMID: 26293849 DOI: 10.15265/iy-2015-014] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
OBJECTIVES To review how health informatics systems based on machine learning methods have impacted the clinical management of patients, by affecting clinical practice. METHODS We reviewed literature from 2010-2015 from databases such as Pubmed, IEEE xplore, and INSPEC, in which methods based on machine learning are likely to be reported. We bring together a broad body of literature, aiming to identify those leading examples of health informatics that have advanced the methodology of machine learning. While individual methods may have further examples that might be added, we have chosen some of the most representative, informative exemplars in each case. RESULTS Our survey highlights that, while much research is taking place in this high-profile field, examples of those that affect the clinical management of patients are seldom found. We show that substantial progress is being made in terms of methodology, often by data scientists working in close collaboration with clinical groups. CONCLUSIONS Health informatics systems based on machine learning are in their infancy and the translation of such systems into clinical management has yet to be performed at scale.
Collapse
Affiliation(s)
- D A Clifton
- David A. Clifton, Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, UK, E-mail:
| | | | | | | |
Collapse
|
215
|
Wang M, Cyhaniuk A, Cooper DL, Iyer NN. Identification of people with acquired hemophilia in a large electronic health record database. J Blood Med 2017; 8:89-97. [PMID: 28769599 PMCID: PMC5529096 DOI: 10.2147/jbm.s136060] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
Background Electronic health records (EHRs) can provide insights into diagnoses, treatment patterns, and clinical outcomes. Acquired hemophilia (AH) is an ultrarare bleeding disorder characterized by factor VIII inhibiting autoantibodies. Aim To identify patients with AH using an EHR database. Methods Records were accessed from a large EHR database (Humedica) between January 1, 2007 and July 31, 2013. Broad selection criteria were applied using the International Classification of Diseases, Ninth Revision, clinical modification (ICD-9-CM) code for intrinsic circulating anticoagulants (286.5 and all subcodes) and confirmation of records 6 months before and 12 months after the first diagnosis. Additional selection criteria included mention of “bleeding” within physician notes identified via natural language processing output and a normal prothrombin time and prolonged activated partial thromboplastin time. Results Of 6,348 patients with a diagnosis code of 286.5 or any subcodes, 16 males and 15 females met the selection criteria. The most common bleeding locations reported was gastrointestinal (23%), vaginal (16%), and endocrine (13%). A wide range of comorbidities was reported. Natural language processing identified chart note mention of “hemophilia” in 3 patients (10%), “bruise” in 15 patients (48%), and “pain” in all 31 patients. No patients received a prescription for approved/recommended AH treatments. Four patient cases were reviewed to validate whether the identified cohort had AH; each patient had bleeding symptoms and a normal prothrombin time and prolonged activated partial thromboplastin time, although none received hemostatic treatments. Conclusion In ultrarare disorders, ICD-9-CM coding alone may be insufficient to identify patient cohorts; multimodal analysis combined with in-depth reviews of physician notes may be more effective.
Collapse
Affiliation(s)
- Michael Wang
- Hemophilia and Thrombosis Center, University of Colorado School of Medicine, Aurora, CO
| | | | - David L Cooper
- Clinical Development, Medical and Regulatory Affairs, Novo Nordisk Inc., Plainsboro, NJ, USA
| | - Neeraj N Iyer
- Clinical Development, Medical and Regulatory Affairs, Novo Nordisk Inc., Plainsboro, NJ, USA
| |
Collapse
|
216
|
Clark C, Wellner B, Davis R, Aberdeen J, Hirschman L. Automatic classification of RDoC positive valence severity with a neural network. J Biomed Inform 2017; 75S:S120-S128. [PMID: 28694118 DOI: 10.1016/j.jbi.2017.07.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2017] [Revised: 06/27/2017] [Accepted: 07/05/2017] [Indexed: 10/19/2022]
Abstract
OBJECTIVE Our objective was to develop a machine learning-based system to determine the severity of Positive Valance symptoms for a patient, based on information included in their initial psychiatric evaluation. Severity was rated on an ordinal scale of 0-3 as follows: 0 (absent=no symptoms), 1 (mild=modest significance), 2 (moderate=requires treatment), 3 (severe=causes substantial impairment) by experts. MATERIALS AND METHODS We treated the task of assigning Positive Valence severity as a text classification problem. During development, we experimented with regularized multinomial logistic regression classifiers, gradient boosted trees, and feedforward, fully-connected neural networks. We found both regularization and feature selection via mutual information to be very important in preventing models from overfitting the data. Our best configuration was a neural network with three fully connected hidden layers with rectified linear unit activations. RESULTS Our best performing system achieved a score of 77.86%. The evaluation metric is an inverse normalization of the Mean Absolute Error presented as a percentage number between 0 and 100, where 100 means the highest performance. Error analysis showed that 90% of the system errors involved neighboring severity categories. CONCLUSION Machine learning text classification techniques with feature selection can be trained to recognize broad differences in Positive Valence symptom severity with a modest amount of training data (in this case 600 documents, 167 of which were unannotated). An increase in the amount of annotated data can increase accuracy of symptom severity classification by several percentage points. Additional features and/or a larger training corpus may further improve accuracy.
Collapse
|
217
|
Kagawa R, Kawazoe Y, Ida Y, Shinohara E, Tanaka K, Imai T, Ohe K. Development of Type 2 Diabetes Mellitus Phenotyping Framework Using Expert Knowledge and Machine Learning Approach. J Diabetes Sci Technol 2017; 11:791-799. [PMID: 27932531 PMCID: PMC5588819 DOI: 10.1177/1932296816681584] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
BACKGROUND Phenotyping is an automated technique that can be used to distinguish patients based on electronic health records. To improve the quality of medical care and advance type 2 diabetes mellitus (T2DM) research, the demand for T2DM phenotyping has been increasing. Some existing phenotyping algorithms are not sufficiently accurate for screening or identifying clinical research subjects. OBJECTIVE We propose a practical phenotyping framework using both expert knowledge and a machine learning approach to develop 2 phenotyping algorithms: one is for screening; the other is for identifying research subjects. METHODS We employ expert knowledge as rules to exclude obvious control patients and machine learning to increase accuracy for complicated patients. We developed phenotyping algorithms on the basis of our framework and performed binary classification to determine whether a patient has T2DM. To facilitate development of practical phenotyping algorithms, this study introduces new evaluation metrics: area under the precision-sensitivity curve (AUPS) with a high sensitivity and AUPS with a high positive predictive value. RESULTS The proposed phenotyping algorithms based on our framework show higher performance than baseline algorithms. Our proposed framework can be used to develop 2 types of phenotyping algorithms depending on the tuning approach: one for screening, the other for identifying research subjects. CONCLUSIONS We develop a novel phenotyping framework that can be easily implemented on the basis of proper evaluation metrics, which are in accordance with users' objectives. The phenotyping algorithms based on our framework are useful for extraction of T2DM patients in retrospective studies.
Collapse
Affiliation(s)
- Rina Kagawa
- Department of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
| | - Yoshimasa Kawazoe
- Department of Healthcare Information Management, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Yusuke Ida
- Department of Healthcare Information Management, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Emiko Shinohara
- Department of Healthcare Information Management, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Katsuya Tanaka
- Department of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
| | - Takeshi Imai
- Center for Disease Biology and Integrative Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
| | - Kazuhiko Ohe
- Department of Biomedical Informatics, Graduate School of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
- Department of Healthcare Information Management, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| |
Collapse
|
218
|
Al Sallakh MA, Vasileiou E, Rodgers SE, Lyons RA, Sheikh A, Davies GA. Defining asthma and assessing asthma outcomes using electronic health record data: a systematic scoping review. Eur Respir J 2017; 49:49/6/1700204. [DOI: 10.1183/13993003.00204-2017] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2017] [Accepted: 03/09/2017] [Indexed: 01/25/2023]
Abstract
There is currently no consensus on approaches to defining asthma or assessing asthma outcomes using electronic health record-derived data. We explored these approaches in the recent literature and examined the clarity of reporting.We systematically searched for asthma-related articles published between January 1, 2014 and December 31, 2015, extracted the algorithms used to identify asthma patients and assess severity, control and exacerbations, and examined how the validity of these outcomes was justified.From 113 eligible articles, we found significant heterogeneity in the algorithms used to define asthma (n=66 different algorithms), severity (n=18), control (n=9) and exacerbations (n=24). For the majority of algorithms (n=106), validity was not justified. In the remaining cases, approaches ranged from using algorithms validated in the same databases to using nonvalidated algorithms that were based on clinical judgement or clinical guidelines. The implementation of these algorithms was suboptimally described overall.Although electronic health record-derived data are now widely used to study asthma, the approaches being used are significantly varied and are often underdescribed, rendering it difficult to assess the validity of studies and compare their findings. Given the substantial growth in this body of literature, it is crucial that scientific consensus is reached on the underlying definitions and algorithms.
Collapse
|
219
|
Jonnalagadda SR, Adupa AK, Garg RP, Corona-Cox J, Shah SJ. Text Mining of the Electronic Health Record: An Information Extraction Approach for Automated Identification and Subphenotyping of HFpEF Patients for Clinical Trials. J Cardiovasc Transl Res 2017; 10:313-321. [DOI: 10.1007/s12265-017-9752-2] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Accepted: 05/16/2017] [Indexed: 12/01/2022]
|
220
|
Williams R, Kontopantelis E, Buchan I, Peek N. Clinical code set engineering for reusing EHR data for research: A review. J Biomed Inform 2017; 70:1-13. [PMID: 28442434 DOI: 10.1016/j.jbi.2017.04.010] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2017] [Revised: 03/21/2017] [Accepted: 04/13/2017] [Indexed: 01/26/2023]
Abstract
INTRODUCTION The construction of reliable, reusable clinical code sets is essential when re-using Electronic Health Record (EHR) data for research. Yet code set definitions are rarely transparent and their sharing is almost non-existent. There is a lack of methodological standards for the management (construction, sharing, revision and reuse) of clinical code sets which needs to be addressed to ensure the reliability and credibility of studies which use code sets. OBJECTIVE To review methodological literature on the management of sets of clinical codes used in research on clinical databases and to provide a list of best practice recommendations for future studies and software tools. METHODS We performed an exhaustive search for methodological papers about clinical code set engineering for re-using EHR data in research. This was supplemented with papers identified by snowball sampling. In addition, a list of e-phenotyping systems was constructed by merging references from several systematic reviews on this topic, and the processes adopted by those systems for code set management was reviewed. RESULTS Thirty methodological papers were reviewed. Common approaches included: creating an initial list of synonyms for the condition of interest (n=20); making use of the hierarchical nature of coding terminologies during searching (n=23); reviewing sets with clinician input (n=20); and reusing and updating an existing code set (n=20). Several open source software tools (n=3) were discovered. DISCUSSION There is a need for software tools that enable users to easily and quickly create, revise, extend, review and share code sets and we provide a list of recommendations for their design and implementation. CONCLUSION Research re-using EHR data could be improved through the further development, more widespread use and routine reporting of the methods by which clinical codes were selected.
Collapse
Affiliation(s)
- Richard Williams
- MRC Health eResearch Centre, University of Manchester, Manchester, UK; NIHR Greater Manchester Primary Care Patient Safety Translational Research Centre, University of Manchester, Manchester, UK.
| | - Evangelos Kontopantelis
- MRC Health eResearch Centre, University of Manchester, Manchester, UK; NIHR School for Primary Care Research, University of Manchester, Manchester, UK
| | - Iain Buchan
- MRC Health eResearch Centre, University of Manchester, Manchester, UK; NIHR Greater Manchester Primary Care Patient Safety Translational Research Centre, University of Manchester, Manchester, UK; NIHR Manchester Biomedical Research Centre, University of Manchester, Manchester, UK
| | - Niels Peek
- MRC Health eResearch Centre, University of Manchester, Manchester, UK; NIHR Greater Manchester Primary Care Patient Safety Translational Research Centre, University of Manchester, Manchester, UK
| |
Collapse
|
221
|
Vaduganathan M, Patel RB, Butler J, Metra M. Integrating electronic health records into the study of heart failure: promises and pitfalls. Eur J Heart Fail 2017; 19:1128-1130. [PMID: 28544192 DOI: 10.1002/ejhf.878] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/31/2017] [Accepted: 04/04/2017] [Indexed: 02/03/2023] Open
Affiliation(s)
- Muthiah Vaduganathan
- Brigham and Women's Hospital Heart and Vascular Center and Harvard Medical School, Boston, MA, USA
| | - Ravi B Patel
- Division of Cardiology, Bluhm Cardiovascular Institute at Northwestern Memorial Hospital, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Javed Butler
- Division of Cardiology, Department of Medicine, Stony Brook University, Stony Brook, NY, USA
| | - Marco Metra
- Division of Cardiology, Department of Experimental and Applied Medicine, University of Brescia, Brescia, Italy
| |
Collapse
|
222
|
Cox ZL, Lai P, Lewis CM, Lenihan DJ. Centers for Medicare and Medicaid Services' readmission reports inaccurately describe an institution's decompensated heart failure admissions. Clin Cardiol 2017; 40:620-625. [PMID: 28471510 DOI: 10.1002/clc.22711] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/09/2017] [Revised: 03/02/2017] [Accepted: 03/02/2017] [Indexed: 11/06/2022] Open
Abstract
Hospitals typically use Center for Medicare and Medicaid Services' (CMS) Hospital Readmission Reduction Program (HRRP) administrative reports as the standard of heart failure (HF) admission quantification. We aimed to evaluate the HF admission population identified by CMS HRRP definition of HF hospital admissions compared with a clinically based HF definition. We evaluated all hospital admissions at an academic medical center over 16 months in patients with Medicare fee-for service benefits and age ≥65 years. We compared the CMS HRRP HF definition against an electronic HF identification algorithm. Admissions identified solely by the CMS HF definition were manually reviewed by HF providers. Admissions confirmed with having decompensated HF as the primary problem by manual review or by the HF ID algorithm were deemed "HF positive," whereas those refuted were "HF negative." Of the 1672 all-cause admissions evaluated, 708 (42%) were HF positive. The CMS HF definition identified 440 admissions: sensitivity (54%), specificity (94%), positive predictive value (87%), negative predictive value (74%). The CMS HF definition missed 324 HF admissions because of inclusion/exclusion criteria (15%) and decompensated HF being a secondary diagnosis (85%). The CMS HF definition falsely identified 56 admissions as HF. The most common admission reasons in this cohort included elective pacemaker or defibrillator implantations (n = 13), noncardiac dyspnea (n = 9), left ventricular assist device complications (n = 8), and acute coronary syndrome (n = 6). The CMS HRRP HF report is a poor representation of an institution's HF admissions because of limitations in administrative coding and the HRRP HF report inclusion/exclusion criteria.
Collapse
Affiliation(s)
- Zachary L Cox
- Department of Pharmacy Practice, Lipscomb University College of Pharmacy, Nashville, Tennesee.,Department of Pharmacy, Vanderbilt University Medical Center, Nashville, Tennesee
| | - Pikki Lai
- Division of Cardiology, Vanderbilt University Medical Center, Nashville, Tennesee
| | - Connie M Lewis
- Division of Cardiology, Vanderbilt University Medical Center, Nashville, Tennesee
| | - Daniel J Lenihan
- Division of Cardiology, Vanderbilt University Medical Center, Nashville, Tennesee
| |
Collapse
|
223
|
Development and Prospective Validation of Tools to Accurately Identify Neurosurgical and Critical Care Events in Children With Traumatic Brain Injury. Pediatr Crit Care Med 2017; 18:442-451. [PMID: 28252524 PMCID: PMC5419849 DOI: 10.1097/pcc.0000000000001120] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
OBJECTIVE To develop and validate case definitions (computable phenotypes) to accurately identify neurosurgical and critical care events in children with traumatic brain injury. DESIGN Prospective observational cohort study, May 2013 to September 2015. SETTING Two large U.S. children's hospitals with level 1 Pediatric Trauma Centers. PATIENTS One hundred seventy-four children less than 18 years old admitted to an ICU after traumatic brain injury. MEASUREMENTS AND MAIN RESULTS Prospective data were linked to database codes for each patient. The outcomes were prospectively identified acute traumatic brain injury, intracranial pressure monitor placement, craniotomy or craniectomy, vascular catheter placement, invasive mechanical ventilation, and new gastrostomy tube or tracheostomy placement. Candidate predictors were database codes present in administrative, billing, or trauma registry data. For each clinical event, we developed and validated penalized regression and Boolean classifiers (models to identify clinical events that take database codes as predictors). We externally validated the best model for each clinical event. The primary model performance measure was accuracy, the percent of test patients correctly classified. The cohort included 174 children who required ICU admission after traumatic brain injury. Simple Boolean classifiers were greater than or equal to 94% accurate for seven of nine clinical diagnoses and events. For central venous catheter placement, no classifier achieved 90% accuracy. Classifier accuracy was dependent on available data fields. Five of nine classifiers were acceptably accurate using only administrative data but three required trauma registry fields and two required billing data. CONCLUSIONS In children with traumatic brain injury, computable phenotypes based on simple Boolean classifiers were highly accurate for most neurosurgical and critical care diagnoses and events. The computable phenotypes we developed and validated can be used in any observational study of children with traumatic brain injury and can reasonably be applied in studies of these interventions in other patient populations.
Collapse
|
224
|
Upadhyaya SG, Murphree DH, Ngufor CG, Knight AM, Cronk DJ, Cima RR, Curry TB, Pathak J, Carter RE, Kor DJ. Automated Diabetes Case Identification Using Electronic Health Record Data at a Tertiary Care Facility. Mayo Clin Proc Innov Qual Outcomes 2017; 1:100-110. [PMID: 30225406 PMCID: PMC6135013 DOI: 10.1016/j.mayocpiqo.2017.04.005] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Objective To develop and validate a phenotyping algorithm for the identification of patients with type 1 and type 2 diabetes mellitus (DM) preoperatively using routinely available clinical data from electronic health records. Patients and Methods We used first-order logic rules (if-then-else rules) to imply the presence or absence of DM types 1 and 2. The “if” clause of each rule is a conjunction of logical and, or predicates that provides evidence toward or against the presence of DM. The rule includes International Classification of Diseases, Ninth Revision, Clinical Modification diagnostic codes, outpatient prescription information, laboratory values, and positive annotation of DM in patients’ clinical notes. This study was conducted from March 2, 2015, through February 10, 2016. The performance of our rule-based approach and similar approaches proposed by other institutions was evaluated with a reference standard created by an expert reviewer and implemented for routine clinical care at an academic medical center. Results A total of 4208 surgical patients (mean age, 52 years; males, 48%) were analyzed to develop the phenotyping algorithm. Expert review identified 685 patients (16.28% of the full cohort) as having DM. Our proposed method identified 684 patients (16.25%) as having DM. The algorithm performed well—99.70% sensitivity, 99.97% specificity—and compared favorably with previous approaches. Conclusion Among patients undergoing surgery, determination of DM can be made with high accuracy using simple, computationally efficient rules. Knowledge of patients’ DM status before surgery may alter physicians’ care plan and reduce postsurgical complications. Nevertheless, future efforts are necessary to determine the effect of first-order logic rules on clinical processes and patient outcomes.
Collapse
Key Words
- CCW, Chronic Condition Data Warehouse
- DDC, Durham Diabetes Coalition
- DM, diabetes mellitus
- EHR, electronic health record
- HbA1c of NYC, Hemoglobin A1c of New York City
- HbA1c, hemoglobin A1c
- ICD-9-CM, International Classification of Diseases, Ninth Revision, Clinical Modification
- MICS, Mayo Integrated Clinical Systems
- NLP, natural language processing
- SUPREME-DM, Surveillance, Prevention, and Management of Diabetes Mellitus
- T1DM, type 1 diabetes mellitus
- T2DM, type 2 diabetes mellitus
- eMERGE, Electronic Medical Records and Genomics
Collapse
Affiliation(s)
| | | | - Che G Ngufor
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Alison M Knight
- Department of Anesthesiology and Perioperative Medicine, Mayo Clinic, Rochester, MN
| | - Daniel J Cronk
- Department of Information Technology, Mayo Clinic, Rochester, MN
| | - Robert R Cima
- Division of Colon and Rectal Surgery, Mayo Clinic, Rochester, MN.,Robert D. and Patricia E. Kern Center for Science of Health Care Delivery, Mayo Clinic, Rochester, MN
| | - Timothy B Curry
- Department of Anesthesiology and Perioperative Medicine, Mayo Clinic, Rochester, MN.,Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN
| | | | - Rickey E Carter
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Daryl J Kor
- Department of Anesthesiology and Perioperative Medicine, Mayo Clinic, Rochester, MN
| |
Collapse
|
225
|
Marshall EA, Oates JC, Shoaibi A, Obeid JS, Habrat ML, Warren RW, Brady KT, Lenert LA. A population-based approach for implementing change from opt-out to opt-in research permissions. PLoS One 2017; 12:e0168223. [PMID: 28441388 PMCID: PMC5404843 DOI: 10.1371/journal.pone.0168223] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2016] [Accepted: 11/28/2016] [Indexed: 01/23/2023] Open
Abstract
Due to recently proposed changes in the Common Rule regarding the collection of research preferences, there is an increased need for efficient methods to document opt-in research preferences at a population level. Previously, our institution developed an opt-out paper-based workflow that could not be utilized for research in a scalable fashion. This project was designed to demonstrate the feasibility of implementing an electronic health record (EHR)-based active opt-in research preferences program. The first phase of implementation required creating and disseminating a patient questionnaire through the EHR portal to populate discreet fields within the EHR indicating patients' preferences for future research study contact (contact) and their willingness to allow anonymised use of excess tissue and fluid specimens (biobank). In the second phase, the questionnaire was presented within a clinic nurse intake workflow in an obstetrical clinic. These permissions were tabulated in registries for use by investigators for feasibility studies and recruitment. The registry was also used for research patient contact management using a new EHR encounter type to differentiate research from clinical encounters. The research permissions questionnaire was sent to 59,670 patients via the EHR portal. Within four months, 21,814 responses (75% willing to participate in biobanking, and 72% willing to be contacted for future research) were received. Each response was recorded within a patient portal encounter to enable longitudinal analysis of responses. We obtained a significantly lower positive response from the 264 females who completed the questionnaire in the obstetrical clinic (55% volunteers for biobank and 52% for contact). We demonstrate that it is possible to establish a research permissions registry using the EHR portal and clinic-based workflows. This patient-centric, population-based, opt-in approach documents preferences in the EHR, allowing linkage of these preferences to health record information.
Collapse
Affiliation(s)
- Elizabeth A. Marshall
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, United States of America
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Jim C. Oates
- Department of Medicine, Division of Rheumatology and Immunology, Medical University of South Carolina, Charleston, South Carolina, United States of America
- Medical Service, Rheumatology Section, Ralph H. Johnson VA Medical Center, Charleston, South Carolina, United States of America
| | - Azza Shoaibi
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, United States of America
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Jihad S. Obeid
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Melissa L. Habrat
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Robert W. Warren
- Department of Pediatrics, Division of Pediatric Rheumatology and Immunology, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Kathleen T. Brady
- Department of Psychiatry and Behavioral Sciences, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Leslie A. Lenert
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, United States of America
- Department of Medicine, Division of General Internal Medicine, Medical University of South Carolina, Charleston, South Carolina, United States of America
| |
Collapse
|
226
|
EHR-based phenotyping: Bulk learning and evaluation. J Biomed Inform 2017; 70:35-51. [PMID: 28410982 DOI: 10.1016/j.jbi.2017.04.009] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2016] [Revised: 03/09/2017] [Accepted: 04/10/2017] [Indexed: 01/29/2023]
Abstract
In data-driven phenotyping, a core computational task is to identify medical concepts and their variations from sources of electronic health records (EHR) to stratify phenotypic cohorts. A conventional analytic framework for phenotyping largely uses a manual knowledge engineering approach or a supervised learning approach where clinical cases are represented by variables encompassing diagnoses, medicinal treatments and laboratory tests, among others. In such a framework, tasks associated with feature engineering and data annotation remain a tedious and expensive exercise, resulting in poor scalability. In addition, certain clinical conditions, such as those that are rare and acute in nature, may never accumulate sufficient data over time, which poses a challenge to establishing accurate and informative statistical models. In this paper, we use infectious diseases as the domain of study to demonstrate a hierarchical learning method based on ensemble learning that attempts to address these issues through feature abstraction. We use a sparse annotation set to train and evaluate many phenotypes at once, which we call bulk learning. In this batch-phenotyping framework, disease cohort definitions can be learned from within the abstract feature space established by using multiple diseases as a substrate and diagnostic codes as surrogates. In particular, using surrogate labels for model training renders possible its subsequent evaluation using only a sparse annotated sample. Moreover, statistical models can be trained and evaluated, using the same sparse annotation, from within the abstract feature space of low dimensionality that encapsulates the shared clinical traits of these target diseases, collectively referred to as the bulk learning set.
Collapse
|
227
|
Shivade C, Hebert C, Regan K, Fosler-Lussier E, Lai AM. Automatic data source identification for clinical trial eligibility criteria resolution. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1149-1158. [PMID: 28269912 PMCID: PMC5333255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Clinical trial coordinators refer to both structured and unstructured sources of data when evaluating a subject for eligibility. While some eligibility criteria can be resolved using structured data, some require manual review of clinical notes. An important step in automating the trial screening process is to be able to identify the right data source for resolving each criterion. In this work, we discuss the creation of an eligibility criteria dataset for clinical trials for patients with two disparate diseases, annotated with the preferred data source for each criterion (i.e., structured or unstructured) by annotators with medical training. The dataset includes 50 heart-failure trials with a total of 766 eligibility criteria and 50 trials for chronic lymphocytic leukemia (CLL) with 677 criteria. Further, we developed machine learning models to predict the preferred data source: kernel methods outperform simpler learning models when used with a combination of lexical, syntactic, semantic, and surface features. Evaluation of these models indicates that the performance is consistent across data from both diagnoses, indicating generalizability of our method. Our findings are an important step towards ongoing efforts for automation of clinical trial screening.
Collapse
Affiliation(s)
| | - Courtney Hebert
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH
| | - Kelly Regan
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH
| | | | - Albert M Lai
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH.; National Institute of Health, Rehabilitation Medicine Department, Mark O. Hatfield Clinical Research Center, Bethesda, MD
| |
Collapse
|
228
|
Duan R, Cao M, Wu Y, Huang J, Denny JC, Xu H, Chen Y. An Empirical Study for Impacts of Measurement Errors on EHR based Association Studies. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1764-1773. [PMID: 28269935 PMCID: PMC5333313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Over the last decade, Electronic Health Records (EHR) systems have been increasingly implemented at US hospitals. Despite their great potential, the complex and uneven nature of clinical documentation and data quality brings additional challenges for analyzing EHR data. A critical challenge is the information bias due to the measurement errors in outcome and covariates. We conducted empirical studies to quantify the impacts of the information bias on association study. Specifically, we designed our simulation studies based on the characteristics of the Electronic Medical Records and Genomics (eMERGE) Network. Through simulation studies, we quantified the loss of power due to misclassifications in case ascertainment and measurement errors in covariate status extraction, with respect to different levels of misclassification rates, disease prevalence, and covariate frequencies. These empirical findings can inform investigators for better understanding of the potential power loss due to misclassification and measurement errors under a variety of conditions in EHR based association studies.
Collapse
Affiliation(s)
- Rui Duan
- Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| | - Ming Cao
- School of Public Health, The University of Texas Health Science Center at Houston Houston, TX, USA
| | - Yonghui Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Jing Huang
- School of Public Health, The University of Texas Health Science Center at Houston Houston, TX, USA
| | - Joshua C Denny
- Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, USA; Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yong Chen
- Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
229
|
Goodwin TR, Harabagiu SM. Multi-modal Patient Cohort Identification from EEG Report and Signal Data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1794-1803. [PMID: 28269938 PMCID: PMC5333290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Clinical electroencephalography (EEG) is the most important investigation in the diagnosis and management of epilepsies. An EEG records the electrical activity along the scalp and measures spontaneous electrical activity of the brain. Because the EEG signal is complex, its interpretation is known to produce moderate inter-observer agreement among neurologists. This problem can be addressed by providing clinical experts with the ability to automatically retrieve similar EEG signals and EEG reports through a patient cohort retrieval system operating on a vast archive of EEG data. In this paper, we present a multi-modal EEG patient cohort retrieval system called MERCuRY which leverages the heterogeneous nature of EEG data by processing both the clinical narratives from EEG reports as well as the raw electrode potentials derived from the recorded EEG signal data. At the core of MERCuRY is a novel multimodal clinical indexing scheme which relies on EEG data representations obtained through deep learning. The index is used by two clinical relevance models that we have generated for identifying patient cohorts satisfying the inclusion and exclusion criteria expressed in natural language queries. Evaluations of the MERCuRY system measured the relevance of the patient cohorts, obtaining MAP scores of 69.87% and a NDCG of 83.21%.
Collapse
|
230
|
Kuo TT, Rao P, Maehara C, Doan S, Chaparro JD, Day ME, Farcas C, Ohno-Machado L, Hsu CN. Ensembles of NLP Tools for Data Element Extraction from Clinical Notes. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1880-1889. [PMID: 28269947 PMCID: PMC5333200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Natural Language Processing (NLP) is essential for concept extraction from narrative text in electronic health records (EHR). To extract numerous and diverse concepts, such as data elements (i.e., important concepts related to a certain medical condition), a plausible solution is to combine various NLP tools into an ensemble to improve extraction performance. However, it is unclear to what extent ensembles of popular NLP tools improve the extraction of numerous and diverse concepts. Therefore, we built an NLP ensemble pipeline to synergize the strength of popular NLP tools using seven ensemble methods, and to quantify the improvement in performance achieved by ensembles in the extraction of data elements for three very different cohorts. Evaluation results show that the pipeline can improve the performance of NLP tools, but there is high variability depending on the cohort.
Collapse
Affiliation(s)
| | | | | | - Son Doan
- University of California San Diego, La Jolla, CA
| | | | | | | | | | - Chun-Nan Hsu
- University of California San Diego, La Jolla, CA
| |
Collapse
|
231
|
Jackson RG, Patel R, Jayatilleke N, Kolliakou A, Ball M, Gorrell G, Roberts A, Dobson RJ, Stewart R. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project. BMJ Open 2017; 7:e012012. [PMID: 28096249 PMCID: PMC5253558 DOI: 10.1136/bmjopen-2016-012012] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/23/2016] [Revised: 08/11/2016] [Accepted: 10/04/2016] [Indexed: 01/13/2023] Open
Abstract
OBJECTIVES We sought to use natural language processing to develop a suite of language models to capture key symptoms of severe mental illness (SMI) from clinical text, to facilitate the secondary use of mental healthcare data in research. DESIGN Development and validation of information extraction applications for ascertaining symptoms of SMI in routine mental health records using the Clinical Record Interactive Search (CRIS) data resource; description of their distribution in a corpus of discharge summaries. SETTING Electronic records from a large mental healthcare provider serving a geographic catchment of 1.2 million residents in four boroughs of south London, UK. PARTICIPANTS The distribution of derived symptoms was described in 23 128 discharge summaries from 7962 patients who had received an SMI diagnosis, and 13 496 discharge summaries from 7575 patients who had received a non-SMI diagnosis. OUTCOME MEASURES Fifty SMI symptoms were identified by a team of psychiatrists for extraction based on salience and linguistic consistency in records, broadly categorised under positive, negative, disorganisation, manic and catatonic subgroups. Text models for each symptom were generated using the TextHunter tool and the CRIS database. RESULTS We extracted data for 46 symptoms with a median F1 score of 0.88. Four symptom models performed poorly and were excluded. From the corpus of discharge summaries, it was possible to extract symptomatology in 87% of patients with SMI and 60% of patients with non-SMI diagnosis. CONCLUSIONS This work demonstrates the possibility of automatically extracting a broad range of SMI symptoms from English text discharge summaries for patients with an SMI diagnosis. Descriptive data also indicated that most symptoms cut across diagnoses, rather than being restricted to particular groups.
Collapse
Affiliation(s)
- Richard G Jackson
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Rashmi Patel
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Nishamali Jayatilleke
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Anna Kolliakou
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Michael Ball
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Genevieve Gorrell
- Department of Computer Science, University of Sheffield, Sheffield, UK
| | - Angus Roberts
- Department of Computer Science, University of Sheffield, Sheffield, UK
| | - Richard J Dobson
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Robert Stewart
- Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| |
Collapse
|
232
|
Claveau V, Silva Oliveira LE, Bouzillé G, Cuggia M, Cabral Moro CM, Grabar N. Numerical Eligibility Criteria in Clinical Protocols: Annotation, Automatic Detection and Interpretation. Artif Intell Med 2017. [DOI: 10.1007/978-3-319-59758-4_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
233
|
Cox ZL, Lewis CM, Lai P, Lenihan DJ. Validation of an automated electronic algorithm and "dashboard" to identify and characterize decompensated heart failure admissions across a medical center. Am Heart J 2017; 183:40-48. [PMID: 27979040 DOI: 10.1016/j.ahj.2016.10.001] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/14/2016] [Accepted: 10/01/2016] [Indexed: 11/19/2022]
Abstract
BACKGROUND We aim to validate the diagnostic performance of the first fully automatic, electronic heart failure (HF) identification algorithm and evaluate the implementation of an HF Dashboard system with 2 components: real-time identification of decompensated HF admissions and accurate characterization of disease characteristics and medical therapy. METHODS We constructed an HF identification algorithm requiring 3 of 4 identifiers: B-type natriuretic peptide >400 pg/mL; admitting HF diagnosis; history of HF International Classification of Disease, Ninth Revision, diagnosis codes; and intravenous diuretic administration. We validated the diagnostic accuracy of the components individually (n = 366) and combined in the HF algorithm (n = 150) compared with a blinded provider panel in 2 separate cohorts. We built an HF Dashboard within the electronic medical record characterizing the disease and medical therapies of HF admissions identified by the HF algorithm. We evaluated the HF Dashboard's performance over 26 months of clinical use. RESULTS Individually, the algorithm components displayed variable sensitivity and specificity, respectively: B-type natriuretic peptide >400 pg/mL (89% and 87%); diuretic (80% and 92%); and International Classification of Disease, Ninth Revision, code (56% and 95%). The HF algorithm achieved a high specificity (95%), positive predictive value (82%), and negative predictive value (85%) but achieved limited sensitivity (56%) secondary to missing provider-generated identification data. The HF Dashboard identified and characterized 3147 HF admissions over 26 months. CONCLUSIONS Automated identification and characterization systems can be developed and used with a substantial degree of specificity for the diagnosis of decompensated HF, although sensitivity is limited by clinical data input.
Collapse
Affiliation(s)
- Zachary L Cox
- Department of Pharmacy Practice, Lipscomb University College of Pharmacy, Nashville, TN; Department of Pharmacy, Vanderbilt University Medical Center, Nashville, TN.
| | - Connie M Lewis
- Division of Cardiology, Vanderbilt University Medical Center, Nashville, TN
| | - Pikki Lai
- Division of Cardiology, Vanderbilt University Medical Center, Nashville, TN
| | - Daniel J Lenihan
- Division of Cardiology, Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
234
|
Daniel C, Ouagne D, Sadou E, Paris N, Hussain S, Jaulent M, Kalra D. Cross border semantic interoperability for learning health systems: The EHR4CR semantic resources and services. Learn Health Syst 2017; 1:e10014. [PMID: 31245551 PMCID: PMC6516724 DOI: 10.1002/lrh2.10014] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2015] [Revised: 07/07/2016] [Accepted: 07/28/2016] [Indexed: 12/15/2022] Open
Abstract
With the development of platforms enabling the integration and use of phenome, genome, and exposome data in the context of international research, data management challenges are increasing, and scalable solutions for cross border and cross domain semantic interoperability need to be developed. Reusing routinely collected clinical data, especially, requires computable portable phenotype algorithms running across different electronic health record (EHR) products and healthcare systems. We propose a framework for describing and comparing mediation platforms enabling cross border phenotype identification within federated EHRs. This framework was used to describe the experience gained during the EHR4CR project and the evaluation of the platform developed for accessing semantically equivalent data elements across 11 European participating EHR systems from 5 countries. Developers of semantic interoperability platforms are beginning to address a core set of requirements in order to reach the goal of developing cross border semantic integration of data.
Collapse
Affiliation(s)
- Christel Daniel
- Sorbonne Universités, UPMC Univ Paris 06, INSERM UMR_S 1142, LIMICSF‐75006ParisFrance
- AP‐HPParisFrance
| | - David Ouagne
- Sorbonne Universités, UPMC Univ Paris 06, INSERM UMR_S 1142, LIMICSF‐75006ParisFrance
| | - Eric Sadou
- Sorbonne Universités, UPMC Univ Paris 06, INSERM UMR_S 1142, LIMICSF‐75006ParisFrance
- AP‐HPParisFrance
| | | | - Sajjad Hussain
- Sorbonne Universités, UPMC Univ Paris 06, INSERM UMR_S 1142, LIMICSF‐75006ParisFrance
| | | | | |
Collapse
|
235
|
Blecker S, Katz SD, Horwitz LI, Kuperman G, Park H, Gold A, Sontag D. Comparison of Approaches for Heart Failure Case Identification From Electronic Health Record Data. JAMA Cardiol 2016; 1:1014-1020. [PMID: 27706470 DOI: 10.1001/jamacardio.2016.3236] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Importance Accurate, real-time case identification is needed to target interventions to improve quality and outcomes for hospitalized patients with heart failure. Problem lists may be useful for case identification but are often inaccurate or incomplete. Machine-learning approaches may improve accuracy of identification but can be limited by complexity of implementation. Objective To develop algorithms that use readily available clinical data to identify patients with heart failure while in the hospital. Design, Setting, and Participants We performed a retrospective study of hospitalizations at an academic medical center. Hospitalizations for patients 18 years or older who were admitted after January 1, 2013, and discharged before February 28, 2015, were included. From a random 75% sample of hospitalizations, we developed 5 algorithms for heart failure identification using electronic health record data: (1) heart failure on problem list; (2) presence of at least 1 of 3 characteristics: heart failure on problem list, inpatient loop diuretic, or brain natriuretic peptide level of 500 pg/mL or higher; (3) logistic regression of 30 clinically relevant structured data elements; (4) machine-learning approach using unstructured notes; and (5) machine-learning approach using structured and unstructured data. Main Outcomes and Measures Heart failure diagnosis based on discharge diagnosis and physician review of sampled medical records. Results A total of 47 119 hospitalizations were included in this study (mean [SD] age, 60.9 [18.15] years; 23 952 female [50.8%], 5258 black/African American [11.2%], and 3667 Hispanic/Latino [7.8%] patients). Of these hospitalizations, 6549 (13.9%) had a discharge diagnosis of heart failure. Inclusion of heart failure on the problem list (algorithm 1) had a sensitivity of 0.40 and a positive predictive value (PPV) of 0.96 for heart failure identification. Algorithm 2 improved sensitivity to 0.77 at the expense of a PPV of 0.64. Algorithms 3, 4, and 5 had areas under the receiver operating characteristic curves of 0.953, 0.969, and 0.974, respectively. With a PPV of 0.9, these algorithms had associated sensitivities of 0.68, 0.77, and 0.83, respectively. Conclusions and Relevance The problem list is insufficient for real-time identification of hospitalized patients with heart failure. The high predictive accuracy of machine learning using free text demonstrates that support of such analytics in future electronic health record systems can improve cohort identification.
Collapse
Affiliation(s)
- Saul Blecker
- Department of Population Health, New York University School of Medicine, New York2Department of Medicine, New York University School of Medicine, New York
| | - Stuart D Katz
- Department of Medicine, New York University School of Medicine, New York
| | - Leora I Horwitz
- Department of Population Health, New York University School of Medicine, New York2Department of Medicine, New York University School of Medicine, New York
| | - Gilad Kuperman
- Department of Information Systems, NewYork-Presbyterian Hospital, New York
| | - Hannah Park
- Department of Population Health, New York University School of Medicine, New York
| | - Alex Gold
- Department of Medicine, New York University School of Medicine, New York
| | - David Sontag
- Department of Computer Science, New York University, New York
| |
Collapse
|
236
|
Vande Loo SJ, North F. Patient question set proliferation: scope and informatics challenges of patient question set management in a large multispecialty practice with case examples pertaining to tobacco use, menopause, and Urology and Orthopedics specialties. BMC Med Inform Decis Mak 2016; 16:41. [PMID: 27066892 PMCID: PMC4828833 DOI: 10.1186/s12911-016-0279-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Accepted: 04/01/2016] [Indexed: 11/13/2022] Open
Abstract
Background Health care institutions have patient question sets that can expand over time. For a multispecialty group, each specialty might have multiple question sets. As a result, question set governance can be challenging. Knowledge of the counts, variability and repetition of questions in a multispecialty practice can help institutions understand the challenges of question set proliferation. Methods We analyzed patient-facing question sets that were subject to institutional governance and those that were not. We examined question variability and number of repetitious questions for a simulated episode of care. In addition to examining general patient question sets, we used specific examples of tobacco questions, questions from two specialty areas, and questions to menopausal women. Results In our analysis, there were approximately 269 institutionally governed patient question sets with a mean of 74 questions per set accounting for an estimated 20,000 governed questions. Sampling from selected specialties revealed that 50 % of patient question sets were not institutionally governed. We found over 650 tobacco-related questions in use, many with only slight variations. A simulated use case for a menopausal woman revealed potentially over 200 repeated questions. Conclusions A group practice with multiple specialties can have a large volume of patient questions that are not centrally developed, stored or governed. This results in a lack of standardization and coordination. Patients may be given multiple repeated questions throughout the course of their care, and providers lack standardized question sets to help construct valid patient phenotypes. Even with the implementation of a single electronic health record, medical practices may still have a health information management gap in the ability to create, store and share patient-generated health information that is meaningful to both patients and physicians. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0279-2) contains supplementary material, which is available to authorized users.
Collapse
|
237
|
Dendrou CA, McVean G, Fugger L. Neuroinflammation - using big data to inform clinical practice. Nat Rev Neurol 2016; 12:685-698. [PMID: 27857124 DOI: 10.1038/nrneurol.2016.171] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Neuroinflammation is emerging as a central process in many neurological conditions, either as a causative factor or as a secondary response to nervous system insult. Understanding the causes and consequences of neuroinflammation could, therefore, provide insight that is needed to improve therapeutic interventions across many diseases. However, the complexity of the pathways involved necessitates the use of high-throughput approaches to extensively interrogate the process, and appropriate strategies to translate the data generated into clinical benefit. Use of 'big data' aims to generate, integrate and analyse large, heterogeneous datasets to provide in-depth insights into complex processes, and has the potential to unravel the complexities of neuroinflammation. Limitations in data analysis approaches currently prevent the full potential of big data being reached, but some aspects of big data are already yielding results. The implementation of 'omics' analyses in particular is becoming routine practice in biomedical research, and neuroimaging is producing large sets of complex data. In this Review, we evaluate the impact of the drive to collect and analyse big data on our understanding of neuroinflammation in disease. We describe the breadth of big data that are leading to an evolution in our understanding of this field, exemplify how these data are beginning to be of use in a clinical setting, and consider possible future directions.
Collapse
Affiliation(s)
- Calliope A Dendrou
- Oxford Centre for Neuroinflammation, Nuffield Department of Clinical Neurosciences, and MRC Human Immunology Unit, Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, University of Oxford, Oxford OX3 9DS, UK
| | - Gil McVean
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, and Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Lars Fugger
- Oxford Centre for Neuroinflammation, Nuffield Department of Clinical Neurosciences, and MRC Human Immunology Unit, Weatherall Institute of Molecular Medicine, John Radcliffe Hospital, University of Oxford, Oxford OX3 9DS, UK
| |
Collapse
|
238
|
Demner-Fushman D, Elhadad N. Aspiring to Unintended Consequences of Natural Language Processing: A Review of Recent Developments in Clinical and Consumer-Generated Text Processing. Yearb Med Inform 2016; 25:224-233. [PMID: 27830255 PMCID: PMC5171557 DOI: 10.15265/iy-2016-017] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
OBJECTIVES This paper reviews work over the past two years in Natural Language Processing (NLP) applied to clinical and consumer-generated texts. METHODS We included any application or methodological publication that leverages text to facilitate healthcare and address the health-related needs of consumers and populations. RESULTS Many important developments in clinical text processing, both foundational and task-oriented, were addressed in community- wide evaluations and discussed in corresponding special issues that are referenced in this review. These focused issues and in-depth reviews of several other active research areas, such as pharmacovigilance and summarization, allowed us to discuss in greater depth disease modeling and predictive analytics using clinical texts, and text analysis in social media for healthcare quality assessment, trends towards online interventions based on rapid analysis of health-related posts, and consumer health question answering, among other issues. CONCLUSIONS Our analysis shows that although clinical NLP continues to advance towards practical applications and more NLP methods are used in large-scale live health information applications, more needs to be done to make NLP use in clinical applications a routine widespread reality. Progress in clinical NLP is mirrored by developments in social media text analysis: the research is moving from capturing trends to addressing individual health-related posts, thus showing potential to become a tool for precision medicine and a valuable addition to the standard healthcare quality evaluation tools.
Collapse
Affiliation(s)
- D Demner-Fushman
- Dina Demner-Fushman, National Library of Medicine, National Institutes of Health, Bldg. 38A, Room 10S-1022, 8600 Rockville Pike MSC-3824, Bethesda, MD 20894, USA, Tel: +1 301 435 5320, Fax: +1 301 402 0341, E-mail:
| | | |
Collapse
|
239
|
Kirby JC, Speltz P, Rasmussen LV, Basford M, Gottesman O, Peissig PL, Pacheco JA, Tromp G, Pathak J, Carrell DS, Ellis SB, Lingren T, Thompson WK, Savova G, Haines J, Roden DM, Harris PA, Denny JC. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc 2016; 23:1046-1052. [PMID: 27026615 PMCID: PMC5070514 DOI: 10.1093/jamia/ocv202] [Citation(s) in RCA: 213] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2015] [Revised: 10/27/2015] [Accepted: 11/25/2015] [Indexed: 01/29/2023] Open
Abstract
OBJECTIVE Health care generated data have become an important source for clinical and genomic research. Often, investigators create and iteratively refine phenotype algorithms to achieve high positive predictive values (PPVs) or sensitivity, thereby identifying valid cases and controls. These algorithms achieve the greatest utility when validated and shared by multiple health care systems.Materials and Methods We report the current status and impact of the Phenotype KnowledgeBase (PheKB, http://phekb.org), an online environment supporting the workflow of building, sharing, and validating electronic phenotype algorithms. We analyze the most frequent components used in algorithms and their performance at authoring institutions and secondary implementation sites. RESULTS As of June 2015, PheKB contained 30 finalized phenotype algorithms and 62 algorithms in development spanning a range of traits and diseases. Phenotypes have had over 3500 unique views in a 6-month period and have been reused by other institutions. International Classification of Disease codes were the most frequently used component, followed by medications and natural language processing. Among algorithms with published performance data, the median PPV was nearly identical when evaluated at the authoring institutions (n = 44; case 96.0%, control 100%) compared to implementation sites (n = 40; case 97.5%, control 100%). DISCUSSION These results demonstrate that a broad range of algorithms to mine electronic health record data from different health systems can be developed with high PPV, and algorithms developed at one site are generally transportable to others. CONCLUSION By providing a central repository, PheKB enables improved development, transportability, and validity of algorithms for research-grade phenotypes using health care generated data.
Collapse
Affiliation(s)
| | - Peter Speltz
- Vanderbilt University Medical Center, Nashville, TN, USA
| | - Luke V Rasmussen
- Northwestern University, Feinberg School of Medicine, Chicago, IL, USA
| | | | - Omri Gottesman
- Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | | | | | | | | | | | - Todd Lingren
- Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Will K Thompson
- Northwestern University, Feinberg School of Medicine, Chicago, IL, USA
| | - Guergana Savova
- Boston Children's Hospital and Harvard Medical School, Boston, MA, USA
| | | | - Dan M Roden
- Vanderbilt University Medical Center, Nashville, TN, USA
| | - Paul A Harris
- Vanderbilt University Medical Center, Nashville, TN, USA
| | - Joshua C Denny
- Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
240
|
Chiaramello E, Pinciroli F, Bonalumi A, Caroli A, Tognola G. Use of “off-the-shelf” information extraction algorithms in clinical informatics: A feasibility study of MetaMap annotation of Italian medical notes. J Biomed Inform 2016; 63:22-32. [DOI: 10.1016/j.jbi.2016.07.017] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Revised: 07/12/2016] [Accepted: 07/17/2016] [Indexed: 02/06/2023]
|
241
|
Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform 2016; 97:120-127. [PMID: 27919371 DOI: 10.1016/j.ijmedinf.2016.09.014] [Citation(s) in RCA: 118] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2016] [Revised: 09/27/2016] [Accepted: 09/30/2016] [Indexed: 01/19/2023]
Abstract
OBJECTIVE To discover diverse genotype-phenotype associations affiliated with Type 2 Diabetes Mellitus (T2DM) via genome-wide association study (GWAS) and phenome-wide association study (PheWAS), more cases (T2DM subjects) and controls (subjects without T2DM) are required to be identified (e.g., via Electronic Health Records (EHR)). However, existing expert based identification algorithms often suffer in a low recall rate and could miss a large number of valuable samples under conservative filtering standards. The goal of this work is to develop a semi-automated framework based on machine learning as a pilot study to liberalize filtering criteria to improve recall rate with a keeping of low false positive rate. MATERIALS AND METHODS We propose a data informed framework for identifying subjects with and without T2DM from EHR via feature engineering and machine learning. We evaluate and contrast the identification performance of widely-used machine learning models within our framework, including k-Nearest-Neighbors, Naïve Bayes, Decision Tree, Random Forest, Support Vector Machine and Logistic Regression. Our framework was conducted on 300 patient samples (161 cases, 60 controls and 79 unconfirmed subjects), randomly selected from 23,281 diabetes related cohort retrieved from a regional distributed EHR repository ranging from 2012 to 2014. RESULTS We apply top-performing machine learning algorithms on the engineered features. We benchmark and contrast the accuracy, precision, AUC, sensitivity and specificity of classification models against the state-of-the-art expert algorithm for identification of T2DM subjects. Our results indicate that the framework achieved high identification performances (∼0.98 in average AUC), which are much higher than the state-of-the-art algorithm (0.71 in AUC). DISCUSSION Expert algorithm-based identification of T2DM subjects from EHR is often hampered by the high missing rates due to their conservative selection criteria. Our framework leverages machine learning and feature engineering to loosen such selection criteria to achieve a high identification rate of cases and controls. CONCLUSIONS Our proposed framework demonstrates a more accurate and efficient approach for identifying subjects with and without T2DM from EHR.
Collapse
Affiliation(s)
- Tao Zheng
- Institute of Image Communication and Networking, Shanghai Jiao Tong University, Shanghai, China; Tongren Hospital Shanghai Jiao Tong University, Shanghai, China
| | - Wei Xie
- Department of Electrical Engineering & Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Liling Xu
- Tongren Hospital Shanghai Jiao Tong University, Shanghai, China
| | - Xiaoying He
- Department of Endocrinology, the First Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China
| | - Ya Zhang
- Institute of Image Communication and Networking, Shanghai Jiao Tong University, Shanghai, China
| | - Mingrong You
- Division of Epidemiology, Vanderbilt University, Nashville, TN, USA
| | - Gong Yang
- Division of Epidemiology, Vanderbilt University, Nashville, TN, USA
| | - You Chen
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA.
| |
Collapse
|
242
|
Lingren T, Thaker V, Brady C, Namjou B, Kennebeck S, Bickel J, Patibandla N, Ni Y, Van Driest SL, Chen L, Roach A, Cobb B, Kirby J, Denny J, Bailey-Davis L, Williams MS, Marsolo K, Solti I, Holm IA, Harley J, Kohane IS, Savova G, Crimmins N. Developing an Algorithm to Detect Early Childhood Obesity in Two Tertiary Pediatric Medical Centers. Appl Clin Inform 2016; 7:693-706. [PMID: 27452794 DOI: 10.4338/aci-2016-01-ra-0015] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2016] [Accepted: 06/15/2016] [Indexed: 01/12/2023] Open
Abstract
OBJECTIVE The objective of this study is to develop an algorithm to accurately identify children with severe early onset childhood obesity (ages 1-5.99 years) using structured and unstructured data from the electronic health record (EHR). INTRODUCTION Childhood obesity increases risk factors for cardiovascular morbidity and vascular disease. Accurate definition of a high precision phenotype through a standardize tool is critical to the success of large-scale genomic studies and validating rare monogenic variants causing severe early onset obesity. DATA AND METHODS Rule based and machine learning based algorithms were developed using structured and unstructured data from two EHR databases from Boston Children's Hospital (BCH) and Cincinnati Children's Hospital and Medical Center (CCHMC). Exclusion criteria including medications or comorbid diagnoses were defined. Machine learning algorithms were developed using cross-site training and testing in addition to experimenting with natural language processing features. RESULTS Precision was emphasized for a high fidelity cohort. The rule-based algorithm performed the best overall, 0.895 (CCHMC) and 0.770 (BCH). The best feature set for machine learning employed Unified Medical Language System (UMLS) concept unique identifiers (CUIs), ICD-9 codes, and RxNorm codes. CONCLUSIONS Detecting severe early childhood obesity is essential for the intervention potential in children at the highest long-term risk of developing comorbidities related to obesity and excluding patients with underlying pathological and non-syndromic causes of obesity assists in developing a high-precision cohort for genetic study. Further such phenotyping efforts inform future practical application in health care environments utilizing clinical decision support.
Collapse
Affiliation(s)
- Todd Lingren
- Todd Lingren, Cincinnati Children's Hospital Medical Center, Biomedical Informatics, 3333 Burnet Avenue, MLC 7024 Cincinnati, OH 45229-3039, Phone: 513-803-9032, Fax: 513-636-2056,
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
243
|
Daniel C, Ouagne D, Sadou E, Forsberg K, Gilchrist MM, Zapletal E, Paris N, Hussain S, Jaulent MC, Kalra D. Cross border semantic interoperability for clinical research: the EHR4CR semantic resources and services. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2016; 2016:51-9. [PMID: 27570649 PMCID: PMC5001763] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
Abstract
With the development of platforms enabling the use of routinely collected clinical data in the context of international clinical research, scalable solutions for cross border semantic interoperability need to be developed. Within the context of the IMI EHR4CR project, we first defined the requirements and evaluation criteria of the EHR4CR semantic interoperability platform and then developed the semantic resources and supportive services and tooling to assist hospital sites in standardizing their data for allowing the execution of the project use cases. The experience gained from the evaluation of the EHR4CR platform accessing to semantically equivalent data elements across 11 European participating EHR systems from 5 countries demonstrated how far the mediation model and mapping efforts met the expected requirements of the project. Developers of semantic interoperability platforms are beginning to address a core set of requirements in order to reach the goal of developing cross border semantic integration of data.
Collapse
Affiliation(s)
- Christel Daniel
- INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France;; AP-HP, Paris, France
| | - David Ouagne
- INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France
| | - Eric Sadou
- INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France;; AP-HP, Paris, France
| | | | | | | | | | - Sajjad Hussain
- INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France
| | - Marie-Christine Jaulent
- INSERM, U1142, LIMICS, F-75006, Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006, Paris, France
| | | |
Collapse
|
244
|
Agarwal V, Podchiyska T, Banda JM, Goel V, Leung TI, Minty EP, Sweeney TE, Gyang E, Shah NH. Learning statistical models of phenotypes using noisy labeled training data. J Am Med Inform Assoc 2016; 23:1166-1173. [PMID: 27174893 DOI: 10.1093/jamia/ocw028] [Citation(s) in RCA: 83] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2015] [Revised: 11/08/2015] [Accepted: 12/12/2015] [Indexed: 01/29/2023] Open
Abstract
OBJECTIVE Traditionally, patient groups with a phenotype are selected through rule-based definitions whose creation and validation are time-consuming. Machine learning approaches to electronic phenotyping are limited by the paucity of labeled training datasets. We demonstrate the feasibility of utilizing semi-automatically labeled training sets to create phenotype models via machine learning, using a comprehensive representation of the patient medical record. METHODS We use a list of keywords specific to the phenotype of interest to generate noisy labeled training data. We train L1 penalized logistic regression models for a chronic and an acute disease and evaluate the performance of the models against a gold standard. RESULTS Our models for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.90, 0.89, and 0.86, 0.89, respectively. Local implementations of the previously validated rule-based definitions for Type 2 diabetes mellitus and myocardial infarction achieve precision and accuracy of 0.96, 0.92 and 0.84, 0.87, respectively.We have demonstrated feasibility of learning phenotype models using imperfectly labeled data for a chronic and acute phenotype. Further research in feature engineering and in specification of the keyword list can improve the performance of the models and the scalability of the approach. CONCLUSIONS Our method provides an alternative to manual labeling for creating training sets for statistical models of phenotypes. Such an approach can accelerate research with large observational healthcare datasets and may also be used to create local phenotype models.
Collapse
Affiliation(s)
- Vibhu Agarwal
- Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA
| | - Tanya Podchiyska
- Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA
| | - Juan M Banda
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA 94305-5479, USA
| | - Veena Goel
- Department of Pediatrics, Stanford University School of Medicine, Stanford CA 94305-5208, USA.,Department of Clinical Informatics, Stanford Children's Health, Stanford CA 94305-5474, USA
| | - Tiffany I Leung
- Division of General Medical Disciplines, Stanford University, Stanford CA 94305, USA
| | - Evan P Minty
- Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA.,Faculty of Medicine, University of Calgary, Calgary Alberta, T2N 4N1, Canada
| | - Timothy E Sweeney
- Biomedical Informatics Training Program, Stanford University, Stanford CA 94305-5479, USA.,Department of Surgery, Stanford Hospital & Clinics, Stanford CA 94305-2200, USA
| | - Elsie Gyang
- Division of Vascular Surgery, Stanford Hospital & Clinics, Stanford CA 94305-5642, USA
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford CA 94305-5479, USA
| |
Collapse
|
245
|
Mowery DL, Chapman BE, Conway M, South BR, Madden E, Keyhani S, Chapman WW. Extracting a stroke phenotype risk factor from Veteran Health Administration clinical reports: an information content analysis. J Biomed Semantics 2016; 7:26. [PMID: 27175226 PMCID: PMC4863379 DOI: 10.1186/s13326-016-0065-1] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Accepted: 04/19/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the United States, 795,000 people suffer strokes each year; 10-15 % of these strokes can be attributed to stenosis caused by plaque in the carotid artery, a major stroke phenotype risk factor. Studies comparing treatments for the management of asymptomatic carotid stenosis are challenging for at least two reasons: 1) administrative billing codes (i.e., Current Procedural Terminology (CPT) codes) that identify carotid images do not denote which neurovascular arteries are affected and 2) the majority of the image reports are negative for carotid stenosis. Studies that rely on manual chart abstraction can be labor-intensive, expensive, and time-consuming. Natural Language Processing (NLP) can expedite the process of manual chart abstraction by automatically filtering reports with no/insignificant carotid stenosis findings and flagging reports with significant carotid stenosis findings; thus, potentially reducing effort, costs, and time. METHODS In this pilot study, we conducted an information content analysis of carotid stenosis mentions in terms of their report location (Sections), report formats (structures) and linguistic descriptions (expressions) from Veteran Health Administration free-text reports. We assessed an NLP algorithm, pyConText's, ability to discern reports with significant carotid stenosis findings from reports with no/insignificant carotid stenosis findings given these three document composition factors for two report types: radiology (RAD) and text integration utility (TIU) notes. RESULTS We observed that most carotid mentions are recorded in prose using categorical expressions, within the Findings and Impression sections for RAD reports and within neither of these designated sections for TIU notes. For RAD reports, pyConText performed with high sensitivity (88 %), specificity (84 %), and negative predictive value (95 %) and reasonable positive predictive value (70 %). For TIU notes, pyConText performed with high specificity (87 %) and negative predictive value (92 %), reasonable sensitivity (73 %), and moderate positive predictive value (58 %). pyConText performed with the highest sensitivity processing the full report rather than the Findings or Impressions independently. CONCLUSION We conclude that pyConText can reduce chart review efforts by filtering reports with no/insignificant carotid stenosis findings and flagging reports with significant carotid stenosis findings from the Veteran Health Administration electronic health record, and hence has utility for expediting a comparative effectiveness study of treatment strategies for stroke prevention.
Collapse
Affiliation(s)
- Danielle L. Mowery
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
- />IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
| | - Brian E. Chapman
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
- />IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
| | - Mike Conway
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
| | - Brett R. South
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
- />IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
| | - Erin Madden
- />San Francisco Veteran Affair Health Care System, San Francisco, CA USA
| | - Salomeh Keyhani
- />San Francisco Veteran Affair Health Care System, San Francisco, CA USA
| | - Wendy W. Chapman
- />Department of Biomedical Informatics, University of Utah, Salt Lake City, UT USA
- />IDEAS Center, Veteran Affair Health Care System, Salt Lake City, UT USA
| |
Collapse
|
246
|
Zhou SM, Fernandez-Gutierrez F, Kennedy J, Cooksey R, Atkinson M, Denaxas S, Siebert S, Dixon WG, O’Neill TW, Choy E, Sudlow C, Brophy S. Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis. PLoS One 2016; 11:e0154515. [PMID: 27135409 PMCID: PMC4852928 DOI: 10.1371/journal.pone.0154515] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2016] [Accepted: 04/14/2016] [Indexed: 12/20/2022] Open
Abstract
OBJECTIVES 1) To use data-driven method to examine clinical codes (risk factors) of a medical condition in primary care electronic health records (EHRs) that can accurately predict a diagnosis of the condition in secondary care EHRs. 2) To develop and validate a disease phenotyping algorithm for rheumatoid arthritis using primary care EHRs. METHODS This study linked routine primary and secondary care EHRs in Wales, UK. A machine learning based scheme was used to identify patients with rheumatoid arthritis from primary care EHRs via the following steps: i) selection of variables by comparing relative frequencies of Read codes in the primary care dataset associated with disease case compared to non-disease control (disease/non-disease based on the secondary care diagnosis); ii) reduction of predictors/associated variables using a Random Forest method, iii) induction of decision rules from decision tree model. The proposed method was then extensively validated on an independent dataset, and compared for performance with two existing deterministic algorithms for RA which had been developed using expert clinical knowledge. RESULTS Primary care EHRs were available for 2,238,360 patients over the age of 16 and of these 20,667 were also linked in the secondary care rheumatology clinical system. In the linked dataset, 900 predictors (out of a total of 43,100 variables) in the primary care record were discovered more frequently in those with versus those without RA. These variables were reduced to 37 groups of related clinical codes, which were used to develop a decision tree model. The final algorithm identified 8 predictors related to diagnostic codes for RA, medication codes, such as those for disease modifying anti-rheumatic drugs, and absence of alternative diagnoses such as psoriatic arthritis. The proposed data-driven method performed as well as the expert clinical knowledge based methods. CONCLUSION Data-driven scheme, such as ensemble machine learning methods, has the potential of identifying the most informative predictors in a cost-effective and rapid way to accurately and reliably classify rheumatoid arthritis or other complex medical conditions in primary care EHRs.
Collapse
Affiliation(s)
- Shang-Ming Zhou
- Institute of Life Science, College of Medicine, Swansea University, Swansea, United Kingdom
| | | | - Jonathan Kennedy
- Institute of Life Science, College of Medicine, Swansea University, Swansea, United Kingdom
| | - Roxanne Cooksey
- Institute of Life Science, College of Medicine, Swansea University, Swansea, United Kingdom
| | - Mark Atkinson
- Institute of Life Science, College of Medicine, Swansea University, Swansea, United Kingdom
| | - Spiros Denaxas
- UCL Institute of Health Informatics and Farr Institute of Health Informatics Research, London, United Kingdom
| | - Stefan Siebert
- Institute of Infection, Immunity and Inflammation, University of Glasgow, Glasgow, United Kingdom
| | - William G. Dixon
- Arthritis Research UK Centre for Epidemiology, Institute of Inflammation and Repair, Faculty of Medical and Human Sciences, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom
| | - Terence W. O’Neill
- Arthritis Research UK Centre for Epidemiology, Institute of Inflammation and Repair, Faculty of Medical and Human Sciences, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom
| | - Ernest Choy
- Arthritis Research UK CREATE Centre and Welsh Arthritis Research Network, School of Medicine, Cardiff University, Cardiff, United Kingdom
| | - Cathie Sudlow
- Centre for Clinical Brain Sciences, University of Edinburgh, Edinburgh, United Kingdom
| | | | - Sinead Brophy
- Institute of Life Science, College of Medicine, Swansea University, Swansea, United Kingdom
| |
Collapse
|
247
|
Halpern Y, Horng S, Choi Y, Sontag D. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inform Assoc 2016; 23:731-40. [PMID: 27107443 PMCID: PMC4926745 DOI: 10.1093/jamia/ocw011] [Citation(s) in RCA: 84] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 01/16/2016] [Indexed: 12/18/2022] Open
Abstract
Background Electronic medical records (EMRs) hold a tremendous amount of information about patients that is relevant to determining the optimal approach to patient care. As medicine becomes increasingly precise, a patient’s electronic medical record phenotype will play an important role in triggering clinical decision support systems that can deliver personalized recommendations in real time. Learning with anchors presents a method of efficiently learning statistically driven phenotypes with minimal manual intervention. Materials and Methods We developed a phenotype library that uses both structured and unstructured data from the EMR to represent patients for real-time clinical decision support. Eight of the phenotypes were evaluated using retrospective EMR data on emergency department patients using a set of prospectively gathered gold standard labels. Results We built a phenotype library with 42 publicly available phenotype definitions. Using information from triage time, the phenotype classifiers have an area under the ROC curve (AUC) of infection 0.89, cancer 0.88, immunosuppressed 0.85, septic shock 0.93, nursing home 0.87, anticoagulated 0.83, cardiac etiology 0.89, and pneumonia 0.90. Using information available at the time of disposition from the emergency department, the AUC values are infection 0.91, cancer 0.95, immunosuppressed 0.90, septic shock 0.97, nursing home 0.91, anticoagulated 0.94, cardiac etiology 0.92, and pneumonia 0.97. Discussion The resulting phenotypes are interpretable and fast to build, and perform comparably to statistically learned phenotypes developed with 5000 manually labeled patients. Conclusion Learning with anchors is an attractive option for building a large public repository of phenotype definitions that can be used for a range of health IT applications, including real-time decision support.
Collapse
Affiliation(s)
- Yoni Halpern
- Department of Computer Science, New York University, New York, NY, USA
| | - Steven Horng
- Department of Emergency Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Youngduck Choi
- Department of Computer Science, New York University, New York, NY, USA
| | - David Sontag
- Department of Computer Science, New York University, New York, NY, USA
| |
Collapse
|
248
|
Murphy SN, Herrick C, Wang Y, Wang TD, Sack D, Andriole KP, Wei J, Reynolds N, Plesniak W, Rosen BR, Pieper S, Gollub RL. High throughput tools to access images from clinical archives for research. J Digit Imaging 2016; 28:194-204. [PMID: 25316195 PMCID: PMC4359193 DOI: 10.1007/s10278-014-9733-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Historically, medical images collected in the course of clinical care have been difficult to access for secondary research studies. While there is a tremendous potential value in the large volume of studies contained in clinical image archives, Picture Archiving and Communication Systems (PACS) are designed to optimize clinical operations and workflow. Search capabilities in PACS are basic, limiting their use for population studies, and duplication of archives for research is costly. To address this need, we augment the Informatics for Integrating Biology and the Bedside (i2b2) open source software, providing investigators with the tools necessary to query and integrate medical record and clinical research data. Over 100 healthcare institutions have installed this suite of software tools that allows investigators to search medical record metadata including images for specific types of patients. In this report, we describe a new Medical Imaging Informatics Bench to Bedside (mi2b2) module (www.mi2b2.org), available now as an open source addition to the i2b2 software platform that allows medical imaging examinations collected during routine clinical care to be made available to translational investigators directly from their institution’s clinical PACS for research and educational use in compliance with the Health Insurance Portability and Accountability Act (HIPAA) Omnibus Rule. Access governance within the mi2b2 module is customizable per institution and PACS minimizing impact on clinical systems. Currently in active use at our institutions, this new technology has already been used to facilitate access to thousands of clinical MRI brain studies representing specific patient phenotypes for use in research.
Collapse
Affiliation(s)
- Shawn N Murphy
- Research IS and Computing, Partners HealthCare, Charlestown, MA, 02129, USA,
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
249
|
Bates J, Fodeh SJ, Brandt CA, Womack JA. Classification of radiology reports for falls in an HIV study cohort. J Am Med Inform Assoc 2016; 23:e113-7. [PMID: 26567329 PMCID: PMC4954638 DOI: 10.1093/jamia/ocv155] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Revised: 08/14/2015] [Accepted: 09/08/2015] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE To identify patients in a human immunodeficiency virus (HIV) study cohort who have fallen by applying supervised machine learning methods to radiology reports of the cohort. METHODS We used the Veterans Aging Cohort Study Virtual Cohort (VACS-VC), an electronic health record-based cohort of 146 530 veterans for whom radiology reports were available (N=2 977 739). We created a reference standard of radiology reports, represented each report by a feature set of words and Unified Medical Language System concepts, and then developed several support vector machine (SVM) classifiers for falls. We compared mutual information (MI) ranking and embedded feature selection approaches. The SVM classifier with MI feature selection was chosen to classify all radiology reports in VACS-VC. RESULTS Our SVM classifier with MI feature selection achieved an area under the curve score of 97.04 on the test set. When applied to all the radiology reports in VACS-VC, 80 416 of these reports were classified as positive for a fall. Of these, 11 484 were associated with a fall-related external cause of injury code (E-code) and 68 932 were not, corresponding to 29 280 patients with potential fall-related injuries who could not have been found using E-codes. DISCUSSION Feature selection was crucial to improving the classifier's performance. Feature selection with MI allowed us to select the number of discriminative features to use for classification, in contrast to the embedded feature selection method, in which the number of features is chosen automatically. CONCLUSION Machine learning is an effective method of identifying patients who have suffered a fall. The development of this classifier supplements the clinical researcher's toolkit and reduces dependence on under-coded structured electronic health record data.
Collapse
Affiliation(s)
- Jonathan Bates
- Yale School of Medicine, New Haven, CT VA Connecticut Healthcare System, West Haven, CT
| | | | - Cynthia A Brandt
- Yale School of Medicine, New Haven, CT VA Connecticut Healthcare System, West Haven, CT
| | - Julie A Womack
- Yale School of Nursing, West Haven, CT VA Connecticut Healthcare System, West Haven, CT
| |
Collapse
|
250
|
Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges. Nat Rev Cardiol 2016; 13:350-9. [PMID: 27009423 DOI: 10.1038/nrcardio.2016.42] [Citation(s) in RCA: 177] [Impact Index Per Article: 22.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
The potential for big data analytics to improve cardiovascular quality of care and patient outcomes is tremendous. However, the application of big data in health care is at a nascent stage, and the evidence to date demonstrating that big data analytics will improve care and outcomes is scant. This Review provides an overview of the data sources and methods that comprise big data analytics, and describes eight areas of application of big data analytics to improve cardiovascular care, including predictive modelling for risk and resource use, population management, drug and medical device safety surveillance, disease and treatment heterogeneity, precision medicine and clinical decision support, quality of care and performance measurement, and public health and research applications. We also delineate the important challenges for big data applications in cardiovascular care, including the need for evidence of effectiveness and safety, the methodological issues such as data quality and validation, and the critical importance of clinical integration and proof of clinical utility. If big data analytics are shown to improve quality of care and patient outcomes, and can be successfully implemented in cardiovascular practice, big data will fulfil its potential as an important component of a learning health-care system.
Collapse
Affiliation(s)
- John S Rumsfeld
- University of Colorado School of Medicine, 13001 East 17th Place, Aurora, Colorado 80045, USA.,VA Eastern Colorado Health System, Cardiology (111B), 1055 Clermont Street, Denver, Colorado 80220, USA
| | - Karen E Joynt
- Brigham and Women's Hospital, 75 Francis Street, Boston, Massachusetts 02115, USA.,Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, USA
| | - Thomas M Maddox
- University of Colorado School of Medicine, 13001 East 17th Place, Aurora, Colorado 80045, USA.,VA Eastern Colorado Health System, Cardiology (111B), 1055 Clermont Street, Denver, Colorado 80220, USA
| |
Collapse
|