1
|
Bornet A, Proios D, Yazdani A, Jaume-Santero F, Haller G, Choi E, Teodoro D. Comparing neural language models for medical concept representation and patient trajectory prediction. Artif Intell Med 2025; 163:103108. [PMID: 40086407 DOI: 10.1016/j.artmed.2025.103108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 01/22/2024] [Accepted: 03/09/2025] [Indexed: 03/16/2025]
Abstract
Effective representation of medical concepts is crucial for secondary analyses of electronic health records. Neural language models have shown promise in automatically deriving medical concept representations from clinical data. However, the comparative performance of different language models for creating these empirical representations, and the extent to which they encode medical semantics, has not been extensively studied. This study aims to address this gap by evaluating the effectiveness of three popular language models - word2vec, fastText, and GloVe - in creating medical concept embeddings that capture their semantic meaning. By using a large dataset of digital health records, we created patient trajectories and used them to train the language models. We then assessed the ability of the learned embeddings to encode semantics through an explicit comparison with biomedical terminologies, and implicitly by predicting patient outcomes and trajectories with different levels of available information. Our qualitative analysis shows that empirical clusters of embeddings learned by fastText exhibit the highest similarity with theoretical clustering patterns obtained from biomedical terminologies, with a similarity score between empirical and theoretical clusters of 0.88, 0.80, and 0.92 for diagnosis, procedure, and medication codes, respectively. Conversely, for outcome prediction, word2vec and GloVe tend to outperform fastText, with the former achieving AUROC as high as 0.78, 0.62, and 0.85 for length-of-stay, readmission, and mortality prediction, respectively. In predicting medical codes in patient trajectories, GloVe achieves the highest performance for diagnosis and medication codes (AUPRC of 0.45 and of 0.81, respectively) at the highest level of the semantic hierarchy, while fastText outperforms the other models for procedure codes (AUPRC of 0.66). Our study demonstrates that subword information is crucial for learning medical concept representations, but global embedding vectors are better suited for more high-level downstream tasks, such as trajectory prediction. Thus, these models can be harnessed to learn representations that convey clinical meaning, and our insights highlight the potential of using machine learning techniques to semantically encode medical data.
Collapse
Affiliation(s)
- Alban Bornet
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.
| | - Dimitrios Proios
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland; Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, Geneva, Switzerland
| | - Anthony Yazdani
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland; Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, Geneva, Switzerland
| | - Fernando Jaume-Santero
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Guy Haller
- Department of Acute Care Medicine, Division of Anaesthesiology, Geneva University Hospitals, Switzerland; Department of Epidemiology and Preventive Medicine, Health Services Management and Research Unit, Monash University, Melbourne, Victoria, Australia
| | | | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.
| |
Collapse
|
2
|
O'Neil ST, Madlock-Brown C, Wilkins KJ, McGrath BM, Davis HE, Assaf GS, Wei H, Zareie P, French ET, Loomba J, McMurry JA, Zhou A, Chute CG, Moffitt RA, Pfaff ER, Yoo YJ, Leese P, Chew RF, Lieberman M, Haendel MA. Finding Long-COVID: temporal topic modeling of electronic health records from the N3C and RECOVER programs. NPJ Digit Med 2024; 7:296. [PMID: 39433942 PMCID: PMC11494196 DOI: 10.1038/s41746-024-01286-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Accepted: 10/07/2024] [Indexed: 10/23/2024] Open
Abstract
Post-Acute Sequelae of SARS-CoV-2 infection (PASC), also known as Long-COVID, encompasses a variety of complex and varied outcomes following COVID-19 infection that are still poorly understood. We clustered over 600 million condition diagnoses from 14 million patients available through the National COVID Cohort Collaborative (N3C), generating hundreds of highly detailed clinical phenotypes. Assessing patient clinical trajectories using these clusters allowed us to identify individual conditions and phenotypes strongly increased after acute infection. We found many conditions increased in COVID-19 patients compared to controls, and using a novel method to associate patients with clusters over time, we additionally found phenotypes specific to patient sex, age, wave of infection, and PASC diagnosis status. While many of these results reflect known PASC symptoms, the resolution provided by this unprecedented data scale suggests avenues for improved diagnostics and mechanistic understanding of this multifaceted disease.
Collapse
Affiliation(s)
- Shawn T O'Neil
- Department of Genetics, UNC School of Medicine, Chapel Hill, NC, USA.
| | - Charisse Madlock-Brown
- Health Informatics and Information Management Program, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Kenneth J Wilkins
- Biostatistics Program, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | | | - Hannah E Davis
- Patient-Led Research Collaborative (PLRC), Washington, DC, USA
| | - Gina S Assaf
- Patient-Led Research Collaborative (PLRC), Washington, DC, USA
| | - Hannah Wei
- Patient-Led Research Collaborative (PLRC), Washington, DC, USA
| | - Parya Zareie
- University of California Davis Health, Davis, CA, USA
| | - Evan T French
- Wright Center for Clinical and Translational Research, Virginia Commonwealth University, Richmond, VA, USA
| | - Johanna Loomba
- The Integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia, Charlottesville, VA, USA
| | - Julie A McMurry
- Department of Genetics, UNC School of Medicine, Chapel Hill, NC, USA
| | - Andrea Zhou
- The Integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia, Charlottesville, VA, USA
| | - Christopher G Chute
- Schools of Medicine, Public Health and Nursing, Johns Hopkins University, Baltimore, MD, USA
| | - Richard A Moffitt
- Department of Hematology and Medical Oncology, Emory University, Atlanta, GA, USA
| | - Emily R Pfaff
- NC TraCS Institute, UNC School of Medicine, Chapel Hill, NC, USA
| | - Yun Jae Yoo
- Department of Hematology and Medical Oncology, Emory University, Atlanta, GA, USA
| | - Peter Leese
- NC TraCS Institute, UNC School of Medicine, Chapel Hill, NC, USA
| | - Robert F Chew
- Center for Data Science and AI, RTI International, Research Triangle Park, Durham, NC, USA
| | - Michael Lieberman
- OCHIN, Inc, Portland, OR, USA
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR, USA
| | - Melissa A Haendel
- Department of Genetics, UNC School of Medicine, Chapel Hill, NC, USA
| |
Collapse
|
3
|
Zhang Y, Zhang Y, Wang H. Patient Subtyping via Learning Hidden Markov Models from Pairwise Co-occurrences in EHR Data. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-4. [PMID: 40031524 DOI: 10.1109/embc53108.2024.10781987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Patient subtyping is an effective tool to investigate clinically relevant characteristics of medical services, from which useful insights can be drawn for categorizing patients' conditions and informing disease progressions. Here we propose a hidden Markov model (HMM) based treatment of extracting patient subtypes from electronic health records (EHR). Using real-world clinical data, we show that the HMM based model can effectively identify the latent Markovian structure underlying the EHR data, and derive clinically or medically plausible patient subtypes, which can be used to categorize patients of various conditions.
Collapse
|
4
|
O'Neil ST, Madlock-Brown C, Wilkins KJ, McGrath BM, Davis HE, Assaf GS, Wei H, Zareie P, French ET, Loomba J, McMurry JA, Zhou A, Chute CG, Moffitt RA, Pfaff ER, Yoo YJ, Leese P, Chew RF, Lieberman M, Haendel MA. Finding Long-COVID: Temporal Topic Modeling of Electronic Health Records from the N3C and RECOVER Programs. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2023.09.11.23295259. [PMID: 38947087 PMCID: PMC11213052 DOI: 10.1101/2023.09.11.23295259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
Post-Acute Sequelae of SARS-CoV-2 infection (PASC), also known as Long-COVID, encompasses a variety of complex and varied outcomes following COVID-19 infection that are still poorly understood. We clustered over 600 million condition diagnoses from 14 million patients available through the National COVID Cohort Collaborative (N3C), generating hundreds of highly detailed clinical phenotypes. Assessing patient clinical trajectories using these clusters allowed us to identify individual conditions and phenotypes strongly increased after acute infection. We found many conditions increased in COVID-19 patients compared to controls, and using a novel method to associate patients with clusters over time, we additionally found phenotypes specific to patient sex, age, wave of infection, and PASC diagnosis status. While many of these results reflect known PASC symptoms, the resolution provided by this unprecedented data scale suggests avenues for improved diagnostics and mechanistic understanding of this multifaceted disease.
Collapse
Affiliation(s)
- Shawn T O'Neil
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Charisse Madlock-Brown
- Health Informatics and Information Management Program, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Kenneth J Wilkins
- Biostatistics Program, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | | | | | | | | | - Parya Zareie
- University of California Davis Health, Sacramento, CA, USA
| | - Evan T French
- Wright Center for Clinical and Translational Research, Virginia Commonwealth University, Richmond, VA, USA
| | - Johanna Loomba
- The Integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia, Charlottesville, VA, USA
| | - Julie A McMurry
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Andrea Zhou
- The Integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia, Charlottesville, VA, USA
| | - Christopher G Chute
- Schools of Medicine, Public Health, and Nursing; Johns Hopkins University, Baltimore, MD, USA
| | - Richard A Moffitt
- Department of Hematology and Medical Oncology, Emory University, Atlanta, GA, USA
| | - Emily R Pfaff
- NC TraCS Institute, UNC-School of Medicine, Chapel Hill, NC, USA
| | - Yun Jae Yoo
- Department of Hematology and Medical Oncology, Emory University, Atlanta, GA, USA
| | - Peter Leese
- NC TraCS Institute, UNC-School of Medicine, Chapel Hill, NC, USA
| | - Robert F Chew
- Center for Data Science and AI, RTI International, Research Triangle Park, NC, USA
| | - Michael Lieberman
- OCHIN, Inc. Portland, OR, USA
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR, USA
| | - Melissa A Haendel
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| |
Collapse
|
5
|
Li Y, Yang AY, Marelli A, Li Y. MixEHR-SurG: A joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records. J Biomed Inform 2024; 153:104638. [PMID: 38631461 DOI: 10.1016/j.jbi.2024.104638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 03/07/2024] [Accepted: 04/03/2024] [Indexed: 04/19/2024]
Abstract
Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as mortality or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8211 subjects with 75,187 outpatient claim records of 1767 unique ICD codes; the MIMIC-III consisting of 1458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge. Together, the integration of the Cox proportional hazards model and EHR topic inference in MixEHR-SurG not only leads to competitive mortality prediction but also meaningful phenotype topics for in-depth survival analysis. The software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-SurG.
Collapse
Affiliation(s)
- Yixuan Li
- Department of Mathematics and Statistics, McGill University, Montreal, Canada; Mila - Quebec AI institute, Montreal, Canada
| | - Archer Y Yang
- Department of Mathematics and Statistics, McGill University, Montreal, Canada; Mila - Quebec AI institute, Montreal, Canada; School of Computer Science, McGill University, Montreal, Canada.
| | - Ariane Marelli
- McGill Adult Unit for Congenital Heart Disease (MAUDE Unit), McGill University of Health Centre, Montreal, Canada.
| | - Yue Li
- Mila - Quebec AI institute, Montreal, Canada; School of Computer Science, McGill University, Montreal, Canada.
| |
Collapse
|
6
|
Chiu CC, Wu CM, Chien TN, Kao LJ, Li C, Chu CM. Integrating Structured and Unstructured EHR Data for Predicting Mortality by Machine Learning and Latent Dirichlet Allocation Method. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2023; 20:4340. [PMID: 36901354 PMCID: PMC10001457 DOI: 10.3390/ijerph20054340] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 02/22/2023] [Accepted: 02/24/2023] [Indexed: 06/18/2023]
Abstract
An ICU is a critical care unit that provides advanced medical support and continuous monitoring for patients with severe illnesses or injuries. Predicting the mortality rate of ICU patients can not only improve patient outcomes, but also optimize resource allocation. Many studies have attempted to create scoring systems and models that predict the mortality of ICU patients using large amounts of structured clinical data. However, unstructured clinical data recorded during patient admission, such as notes made by physicians, is often overlooked. This study used the MIMIC-III database to predict mortality in ICU patients. In the first part of the study, only eight structured variables were used, including the six basic vital signs, the GCS, and the patient's age at admission. In the second part, unstructured predictor variables were extracted from the initial diagnosis made by physicians when the patients were admitted to the hospital and analyzed using Latent Dirichlet Allocation techniques. The structured and unstructured data were combined using machine learning methods to create a mortality risk prediction model for ICU patients. The results showed that combining structured and unstructured data improved the accuracy of the prediction of clinical outcomes in ICU patients over time. The model achieved an AUROC of 0.88, indicating accurate prediction of patient vital status. Additionally, the model was able to predict patient clinical outcomes over time, successfully identifying important variables. This study demonstrated that a small number of easily collectible structured variables, combined with unstructured data and analyzed using LDA topic modeling, can significantly improve the predictive performance of a mortality risk prediction model for ICU patients. These results suggest that initial clinical observations and diagnoses of ICU patients contain valuable information that can aid ICU medical and nursing staff in making important clinical decisions.
Collapse
Affiliation(s)
- Chih-Chou Chiu
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Chung-Min Wu
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Te-Nien Chien
- College of Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Ling-Jing Kao
- Department of Business Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Chengcheng Li
- College of Management, National Taipei University of Technology, Taipei 106, Taiwan
| | - Chuan-Mei Chu
- College of Management, National Taipei University of Technology, Taipei 106, Taiwan
| |
Collapse
|
7
|
Woodward AA, Urbanowicz RJ, Naj AC, Moore JH. Genetic heterogeneity: Challenges, impacts, and methods through an associative lens. Genet Epidemiol 2022; 46:555-571. [PMID: 35924480 PMCID: PMC9669229 DOI: 10.1002/gepi.22497] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 07/06/2022] [Accepted: 07/19/2022] [Indexed: 01/07/2023]
Abstract
Genetic heterogeneity describes the occurrence of the same or similar phenotypes through different genetic mechanisms in different individuals. Robustly characterizing and accounting for genetic heterogeneity is crucial to pursuing the goals of precision medicine, for discovering novel disease biomarkers, and for identifying targets for treatments. Failure to account for genetic heterogeneity may lead to missed associations and incorrect inferences. Thus, it is critical to review the impact of genetic heterogeneity on the design and analysis of population level genetic studies, aspects that are often overlooked in the literature. In this review, we first contextualize our approach to genetic heterogeneity by proposing a high-level categorization of heterogeneity into "feature," "outcome," and "associative" heterogeneity, drawing on perspectives from epidemiology and machine learning to illustrate distinctions between them. We highlight the unique nature of genetic heterogeneity as a heterogeneous pattern of association that warrants specific methodological considerations. We then focus on the challenges that preclude effective detection and characterization of genetic heterogeneity across a variety of epidemiological contexts. Finally, we discuss systems heterogeneity as an integrated approach to using genetic and other high-dimensional multi-omic data in complex disease research.
Collapse
Affiliation(s)
- Alexa A. Woodward
- Department of Biostatistics, Epidemiology and InformaticsUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Ryan J. Urbanowicz
- Department of Computational BiomedicineCedars‐Sinai Medical CenterLos AngelesCaliforniaUSA
| | - Adam C. Naj
- Department of Biostatistics, Epidemiology and InformaticsUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Jason H. Moore
- Department of Computational BiomedicineCedars‐Sinai Medical CenterLos AngelesCaliforniaUSA
| |
Collapse
|
8
|
Kline A, Wang H, Li Y, Dennis S, Hutch M, Xu Z, Wang F, Cheng F, Luo Y. Multimodal machine learning in precision health: A scoping review. NPJ Digit Med 2022; 5:171. [PMID: 36344814 PMCID: PMC9640667 DOI: 10.1038/s41746-022-00712-8] [Citation(s) in RCA: 111] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 10/14/2022] [Indexed: 11/09/2022] Open
Abstract
Machine learning is frequently being leveraged to tackle problems in the health sector including utilization for clinical decision-support. Its use has historically been focused on single modal data. Attempts to improve prediction and mimic the multimodal nature of clinical expert decision-making has been met in the biomedical field of machine learning by fusing disparate data. This review was conducted to summarize the current studies in this field and identify topics ripe for future research. We conducted this review in accordance with the PRISMA extension for Scoping Reviews to characterize multi-modal data fusion in health. Search strings were established and used in databases: PubMed, Google Scholar, and IEEEXplore from 2011 to 2021. A final set of 128 articles were included in the analysis. The most common health areas utilizing multi-modal methods were neurology and oncology. Early fusion was the most common data merging strategy. Notably, there was an improvement in predictive performance when using data fusion. Lacking from the papers were clear clinical deployment strategies, FDA-approval, and analysis of how using multimodal approaches from diverse sub-populations may improve biases and healthcare disparities. These findings provide a summary on multimodal data fusion as applied to health diagnosis/prognosis problems. Few papers compared the outputs of a multimodal approach with a unimodal prediction. However, those that did achieved an average increase of 6.4% in predictive accuracy. Multi-modal machine learning, while more robust in its estimations over unimodal methods, has drawbacks in its scalability and the time-consuming nature of information concatenation.
Collapse
Affiliation(s)
- Adrienne Kline
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Hanyin Wang
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Yikuan Li
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Saya Dennis
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Meghan Hutch
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA
| | - Zhenxing Xu
- Department of Population Health Sciences, Cornell University, New York, 10065, NY, USA
| | - Fei Wang
- Department of Population Health Sciences, Cornell University, New York, 10065, NY, USA
| | - Feixiong Cheng
- Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, 44195, OH, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, 60201, IL, USA.
| |
Collapse
|
9
|
Liu F, Demosthenes P. Real-world data: a brief review of the methods, applications, challenges and opportunities. BMC Med Res Methodol 2022; 22:287. [PMID: 36335315 PMCID: PMC9636688 DOI: 10.1186/s12874-022-01768-6] [Citation(s) in RCA: 141] [Impact Index Per Article: 47.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 10/22/2022] [Indexed: 11/07/2022] Open
Abstract
Abstract
Background
The increased adoption of the internet, social media, wearable devices, e-health services, and other technology-driven services in medicine and healthcare has led to the rapid generation of various types of digital data, providing a valuable data source beyond the confines of traditional clinical trials, epidemiological studies, and lab-based experiments.
Methods
We provide a brief overview on the type and sources of real-world data and the common models and approaches to utilize and analyze real-world data. We discuss the challenges and opportunities of using real-world data for evidence-based decision making This review does not aim to be comprehensive or cover all aspects of the intriguing topic on RWD (from both the research and practical perspectives) but serves as a primer and provides useful sources for readers who interested in this topic.
Results and Conclusions
Real-world hold great potential for generating real-world evidence for designing and conducting confirmatory trials and answering questions that may not be addressed otherwise. The voluminosity and complexity of real-world data also call for development of more appropriate, sophisticated, and innovative data processing and analysis techniques while maintaining scientific rigor in research findings, and attentions to data ethics to harness the power of real-world data.
Collapse
|
10
|
Ahuja Y, Wen J, Hong C, Xia Z, Huang S, Cai T. A semi-supervised adaptive Markov Gaussian embedding process (SAMGEP) for prediction of phenotype event times using the electronic health record. Sci Rep 2022; 12:17737. [PMID: 36273240 PMCID: PMC9588081 DOI: 10.1038/s41598-022-22585-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2021] [Accepted: 10/17/2022] [Indexed: 01/18/2023] Open
Abstract
While there exist numerous methods to identify binary phenotypes (i.e. COPD) using electronic health record (EHR) data, few exist to ascertain the timings of phenotype events (i.e. COPD onset or exacerbations). Estimating event times could enable more powerful use of EHR data for longitudinal risk modeling, including survival analysis. Here we introduce Semi-supervised Adaptive Markov Gaussian Embedding Process (SAMGEP), a semi-supervised machine learning algorithm to estimate phenotype event times using EHR data with limited observed labels, which require resource-intensive chart review to obtain. SAMGEP models latent phenotype states as a binary Markov process, and it employs an adaptive weighting strategy to map timestamped EHR features to an embedding function that it models as a state-dependent Gaussian process. SAMGEP's feature weighting achieves meaningful feature selection, and its predictions significantly improve AUCs and F1 scores over existing approaches in diverse simulations and real-world settings. It is particularly adept at predicting cumulative risk and event counting process functions, and is robust to diverse generative model parameters. Moreover, it achieves high accuracy with few (50-100) labels, efficiently leveraging unlabeled EHR data to maximize information gain from costly-to-obtain event time labels. SAMGEP can be used to estimate accurate phenotype state functions for risk modeling research.
Collapse
Affiliation(s)
- Yuri Ahuja
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, 02115, USA. .,Harvard Medical School, Boston, MA, USA. .,Department of Medicine, NYU Langone Health, New York, NY, USA.
| | - Jun Wen
- grid.38142.3c000000041936754XHarvard Medical School, Boston, MA USA
| | - Chuan Hong
- grid.38142.3c000000041936754XHarvard Medical School, Boston, MA USA
| | - Zongqi Xia
- grid.21925.3d0000 0004 1936 9000Department of Neurology, University of Pittsburgh, Pittsburgh, PA USA
| | - Sicong Huang
- grid.38142.3c000000041936754XHarvard Medical School, Boston, MA USA ,grid.62560.370000 0004 0378 8294Division of Rheumatology, Inflammation, and Immunity, Brigham and Women’s Hospital, Boston, MA USA ,grid.410370.10000 0004 4657 1992VA Boston Healthcare System, Boston, MA USA
| | - Tianxi Cai
- grid.38142.3c000000041936754XDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115 USA ,grid.38142.3c000000041936754XHarvard Medical School, Boston, MA USA ,grid.410370.10000 0004 4657 1992VA Boston Healthcare System, Boston, MA USA
| |
Collapse
|
11
|
Kaplan AD, Greene JD, Liu VX, Ray P. Unsupervised probabilistic models for sequential Electronic Health Records. J Biomed Inform 2022; 134:104163. [PMID: 36038064 PMCID: PMC10588733 DOI: 10.1016/j.jbi.2022.104163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 06/23/2022] [Accepted: 08/11/2022] [Indexed: 11/18/2022]
Abstract
We develop an unsupervised probabilistic model for heterogeneous Electronic Health Record (EHR) data. Utilizing a mixture model formulation, our approach directly models sequences of arbitrary length, such as medications and laboratory results. This allows for subgrouping and incorporation of the dynamics underlying heterogeneous data types. The model consists of a layered set of latent variables that encode underlying structure in the data. These variables represent subject subgroups at the top layer, and unobserved states for sequences in the second layer. We train this model on episodic data from subjects receiving medical care in the Kaiser Permanente Northern California integrated healthcare delivery system. The resulting properties of the trained model generate novel insight from these complex and multifaceted data. In addition, we show how the model can be used to analyze sequences that contribute to assessment of mortality likelihood.
Collapse
Affiliation(s)
- Alan D Kaplan
- Computational Engineering Division, Lawrence Livermore National Laboratory, 7000 East Ave., Livermore, CA 94550, United States of America.
| | - John D Greene
- Kaiser Permanente Division of Research, 2000 Broadway, Oakland, CA 94612, United States of America
| | - Vincent X Liu
- Kaiser Permanente Division of Research, 2000 Broadway, Oakland, CA 94612, United States of America
| | - Priyadip Ray
- Computational Engineering Division, Lawrence Livermore National Laboratory, 7000 East Ave., Livermore, CA 94550, United States of America
| |
Collapse
|
12
|
Ahuja Y, Zou Y, Verma A, Buckeridge D, Li Y. MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record. J Biomed Inform 2022; 134:104190. [PMID: 36058522 DOI: 10.1016/j.jbi.2022.104190] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2022] [Revised: 08/27/2022] [Accepted: 08/28/2022] [Indexed: 01/18/2023]
Abstract
Electronic Health Records (EHRs) contain rich clinical data collected at the point of the care, and their increasing adoption offers exciting opportunities for clinical informatics, disease risk prediction, and personalized treatment recommendation. However, effective use of EHR data for research and clinical decision support is often hampered by a lack of reliable disease labels. To compile gold-standard labels, researchers often rely on clinical experts to develop rule-based phenotyping algorithms from billing codes and other surrogate features. This process is tedious and error-prone due to recall and observer biases in how codes and measures are selected, and some phenotypes are incompletely captured by a handful of surrogate features. To address this challenge, we present a novel automatic phenotyping model called MixEHR-Guided (MixEHR-G), a multimodal hierarchical Bayesian topic model that efficiently models the EHR generative process by identifying latent phenotype structure in the data. Unlike existing topic modeling algorithms wherein the inferred topics are not identifiable, MixEHR-G uses prior information from informative surrogate features to align topics with known phenotypes. We applied MixEHR-G to an openly-available EHR dataset of 38,597 intensive care patients (MIMIC-III) in Boston, USA and to administrative claims data for a population-based cohort (PopHR) of 1.3 million people in Quebec, Canada. Qualitatively, we demonstrate that MixEHR-G learns interpretable phenotypes and yields meaningful insights about phenotype similarities, comorbidities, and epidemiological associations. Quantitatively, MixEHR-G outperforms existing unsupervised phenotyping methods on a phenotype label annotation task, and it can accurately estimate relative phenotype prevalence functions without gold-standard phenotype information. Altogether, MixEHR-G is an important step towards building an interpretable and automated phenotyping system using EHR data.
Collapse
Affiliation(s)
- Yuri Ahuja
- Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA; Harvard Medical School, 25 Shattuck St, Boston, MA 02115, USA.
| | - Yuesong Zou
- School of Computer Science, McGill University, 3480 Rue University, Montreal, QC H3A 2A7, Canada
| | - Aman Verma
- School of Population and Global Health, McGill University, 2001 McGill College Avenue, Montreal, Québec H3A 1G1, Canada
| | - David Buckeridge
- School of Population and Global Health, McGill University, 2001 McGill College Avenue, Montreal, Québec H3A 1G1, Canada.
| | - Yue Li
- School of Computer Science, McGill University, 3480 Rue University, Montreal, QC H3A 2A7, Canada.
| |
Collapse
|
13
|
Havrilla JM, Singaravelu A, Driscoll DM, Minkovsky L, Helbig I, Medne L, Wang K, Krantz I, Desai BR. PheNominal: an EHR-integrated web application for structured deep phenotyping at the point of care. BMC Med Inform Decis Mak 2022; 22:198. [PMID: 35902925 PMCID: PMC9335954 DOI: 10.1186/s12911-022-01927-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Accepted: 07/06/2022] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Clinical phenotype information greatly facilitates genetic diagnostic interpretations pipelines in disease. While post-hoc extraction using natural language processing on unstructured clinical notes continues to improve, there is a need to improve point-of-care collection of patient phenotypes. Therefore, we developed "PheNominal", a point-of-care web application, embedded within Epic electronic health record (EHR) workflows, to permit capture of standardized phenotype data. METHODS Using bi-directional web services available within commercial EHRs, we developed a lightweight web application that allows users to rapidly browse and identify relevant terms from the Human Phenotype Ontology (HPO). Selected terms are saved discretely within the patient's EHR, permitting reuse both in clinical notes as well as in downstream diagnostic and research pipelines. RESULTS In the 16 months since implementation, PheNominal was used to capture discrete phenotype data for over 1500 individuals and 11,000 HPO terms during clinic and inpatient encounters for a genetic diagnostic consultation service within a quaternary-care pediatric academic medical center. An average of 7 HPO terms were captured per patient. Compared to a manual workflow, the average time to enter terms for a patient was reduced from 15 to 5 min per patient, and there were fewer annotation errors. CONCLUSIONS Modern EHRs support integration of external applications using application programming interfaces. We describe a practical application of these interfaces to facilitate deep phenotype capture in a discrete, structured format within a busy clinical workflow. Future versions will include a vendor-agnostic implementation using FHIR. We describe pilot efforts to integrate structured phenotyping through controlled dictionaries into diagnostic and research pipelines, reducing manual effort for phenotype documentation and reducing errors in data entry.
Collapse
Affiliation(s)
- James M. Havrilla
- grid.239552.a0000 0001 0680 8770Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
| | - Anbumalar Singaravelu
- grid.239552.a0000 0001 0680 8770Emerging Technology and Transformation Team, Information Services, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
| | - Dennis M. Driscoll
- grid.239552.a0000 0001 0680 8770Emerging Technology and Transformation Team, Information Services, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
| | - Leonard Minkovsky
- grid.239552.a0000 0001 0680 8770Emerging Technology and Transformation Team, Information Services, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
| | - Ingo Helbig
- grid.239552.a0000 0001 0680 8770Division of Neurology, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA ,grid.239552.a0000 0001 0680 8770The Epilepsy NeuroGenetics Initiative (ENGIN), Children’s Hospital of Philadelphia, Philadelphia, USA ,grid.239552.a0000 0001 0680 8770Department of Biomedical and Health Informatics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA ,grid.25879.310000 0004 1936 8972Department of Neurology, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA 19104 USA
| | - Livija Medne
- grid.239552.a0000 0001 0680 8770Roberts Individualized Medical Genetics Center, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
| | - Kai Wang
- grid.239552.a0000 0001 0680 8770Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA ,grid.239552.a0000 0001 0680 8770Department of Biomedical and Health Informatics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA ,grid.25879.310000 0004 1936 8972Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104 USA
| | - Ian Krantz
- grid.239552.a0000 0001 0680 8770Roberts Individualized Medical Genetics Center, Children’s Hospital of Philadelphia, Philadelphia, PA 19104 USA
| | - Bimal R. Desai
- grid.25879.310000 0004 1936 8972Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104 USA
| |
Collapse
|
14
|
Kaplan AD, Tipnis U, Beckham JC, Kimbrel NA, Oslin DW, McMahon BH. Continuous-Time Probabilistic Models for Longitudinal Electronic Health Records. J Biomed Inform 2022; 130:104084. [DOI: 10.1016/j.jbi.2022.104084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 03/18/2022] [Accepted: 04/25/2022] [Indexed: 10/18/2022]
|
15
|
Burgermaster M, Rodriguez VA. Psychosocial-Behavioral Phenotyping: A Novel Precision Health Approach to Modeling Behavioral, Psychological, and Social Determinants of Health Using Machine Learning. Ann Behav Med 2022; 56:1258-1271. [PMID: 35445699 DOI: 10.1093/abm/kaac012] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND The context in which a behavioral intervention is delivered is an important source of variability and systematic approaches are needed to identify and quantify contextual factors that may influence intervention efficacy. Machine learning-based phenotyping methods can contribute to a new precision health paradigm by informing personalized behavior interventions. Two primary goals of precision health, identifying population subgroups and highlighting behavioral intervention targets, can be addressed with psychosocial-behavioral phenotypes. We propose a method for psychosocial-behavioral phenotyping that models social determinants of health in addition to individual-level psychological and behavioral factors. PURPOSE To demonstrate a novel application of machine learning for psychosocial-behavioral phenotyping, the identification of subgroups with similar combinations of psychosocial characteristics. METHODS In this secondary analysis of psychosocial and behavioral data from a community cohort (n = 5,883), we optimized a multichannel mixed membership model (MC3M) using Bayesian inference to identify psychosocial-behavioral phenotypes and used logistic regression to determine which phenotypes were associated with elevated weight status (BMI ≥ 25kg/m2). RESULTS We identified 20 psychosocial-behavioral phenotypes. Phenotypes were conceptually consistent as well as discriminative; most participants had only one active phenotype. Two phenotypes were significantly positively associated with elevated weight status; four phenotypes were significantly negatively associated. Each phenotype suggested different contextual considerations for intervention design. CONCLUSIONS By depicting the complexity of psychological and social determinants of health while also providing actionable insight about similarities and differences among members of the same community, psychosocial-behavioral phenotypes can identify potential intervention targets in context.
Collapse
Affiliation(s)
- Marissa Burgermaster
- Department of Nutritional Sciences, College of Natural Sciences, University of Texas at Austin, Austin, TX, USA.,Department of Population Health, Dell Medical School, University of Texas at Austin, Austin, TX, USA
| | - Victor A Rodriguez
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.,College of Physicians and Surgeons, Columbia University Irving Medical Center, New York, NY, USA
| |
Collapse
|
16
|
Uchida T, Fujiwara K, Nishioji K, Kobayashi M, Kano M, Seko Y, Yamaguchi K, Itoh Y, Kadotani H. Medical checkup data analysis method based on LiNGAM and its application to nonalcoholic fatty liver disease. Artif Intell Med 2022; 128:102310. [PMID: 35534147 DOI: 10.1016/j.artmed.2022.102310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Revised: 03/24/2022] [Accepted: 04/17/2022] [Indexed: 11/02/2022]
|
17
|
Kaplan AD, Cheng Q, Mohan KA, Nelson LD, Jain S, Levin H, Torres-Espin A, Chou A, Huie JR, Ferguson AR, McCrea M, Giacino J, Sundaram S, Markowitz AJ, Manley GT. Mixture Model Framework for Traumatic Brain Injury Prognosis Using Heterogeneous Clinical and Outcome Data. IEEE J Biomed Health Inform 2022; 26:1285-1296. [PMID: 34310331 PMCID: PMC8789941 DOI: 10.1109/jbhi.2021.3099745] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Prognoses of Traumatic Brain Injury (TBI) outcomes are neither easily nor accurately determined from clinical indicators. This is due in part to the heterogeneity of damage inflicted to the brain, ultimately resulting in diverse and complex outcomes. Using a data-driven approach on many distinct data elements may be necessary to describe this large set of outcomes and thereby robustly depict the nuanced differences among TBI patients' recovery. In this work, we develop a method for modeling large heterogeneous data types relevant to TBI. Our approach is geared toward the probabilistic representation of mixed continuous and discrete variables with missing values. The model is trained on a dataset encompassing a variety of data types, including demographics, blood-based biomarkers, and imaging findings. In addition, it includes a set of clinical outcome assessments at 3, 6, and 12 months post-injury. The model is used to stratify patients into distinct groups in an unsupervised learning setting. We use the model to infer outcomes using input data, and show that the collection of input data reduces uncertainty of outcomes over a baseline approach. In addition, we quantify the performance of a likelihood scoring technique that can be used to self-evaluate the extrapolation risk of prognosis on unseen patients.
Collapse
|
18
|
Kim M, Noh Y, Yamada A, Hong SH. Comparison of the Erectile Dysfunction Drugs Sildenafil and Tadalafil Using Patient Medication Reviews: Topic Modeling Study. JMIR Med Inform 2022; 10:e32689. [PMID: 35225813 PMCID: PMC8922152 DOI: 10.2196/32689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Revised: 10/22/2021] [Accepted: 11/17/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Topic modeling of patient medication reviews of erectile dysfunction (ED) drugs can help identify patient preferences regarding ED treatment options. The identification of a set of topics important to the patient from social network service drug reviews would inform the design of patient-centered medication counseling. OBJECTIVE This study aimed to (1) identify the distinctive topics from patient medication reviews unique to tadalafil versus sildenafil; (2) determine if the primary topics are distributed differently for each drug and for each patient characteristic (age and time on ED drug therapy); and (3) test if the primary topics affect satisfaction with ED drug therapy controlling for patient characteristics. METHODS Data were collected from the patient medication reviews of sildenafil and tadalafil posted on WebMD and Ask a Patient. The latent Dirichlet allocation method of natural language processing was used to identify 5 distinctive topics from the patient medication reviews on each drug. Analysis of variance and a 2-sample t test were conducted to compare the topic distribution and assess whether patient satisfaction varies with the primary topics, age, and time on medication for each ED drug. Statistical significance was tested at an alpha of .05. RESULTS The patient medication reviews of sildenafil (N=463) had 2 topics on treatment benefit and 1 each on medication safety, marketing claim, and treatment comparison, while the patient medication reviews of tadalafil (N=919) had 2 topics on medication safety and 1 each on the remaining subjects. Sildenafil's reviewers quite frequently (94/463, 20.4%) mentioned erection sustainability as their primary topic, whereas tadalafil's reviewers were more concerned about severe medication safety. Those who mentioned erection sustainability as their primary topic were quite satisfied with their treatment as opposed to those who mentioned severe medication safety as their primary topic (score 3.85 vs 2.44). The discovered topics reflected the marketing claims of blue magic and amber romance for sildenafil and tadalafil, respectively. The topic of blue magic was preferred among younger patients, while the topic of amber romance was preferred among older patients. The topic alternative choices, which appeared for both the ED drugs, reflected patient interest in the comparative effectiveness and price outside the drug labeling information. CONCLUSIONS The patient medication reviews of ED drugs reflect patient preferences regarding drug labeling information, marketing claims, and alternative treatment choices. The patient preferences concerning ED treatment attributes inform the design of patient-centered communication for improved ED drug therapy.
Collapse
Affiliation(s)
- Maryanne Kim
- College of Pharmacy, Seoul National University, Seoul, Republic of Korea.,Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Youran Noh
- College of Pharmacy, Seoul National University, Seoul, Republic of Korea.,Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul, Republic of Korea
| | - Akihiko Yamada
- College of Pharmacy, Seoul National University, Seoul, Republic of Korea
| | - Song Hee Hong
- College of Pharmacy, Seoul National University, Seoul, Republic of Korea.,Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
19
|
Hong C, Rush E, Liu M, Zhou D, Sun J, Sonabend A, Castro VM, Schubert P, Panickan VA, Cai T, Costa L, He Z, Link N, Hauser R, Gaziano JM, Murphy SN, Ostrouchov G, Ho YL, Begoli E, Lu J, Cho K, Liao KP, Cai T. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data. NPJ Digit Med 2021; 4:151. [PMID: 34707226 PMCID: PMC8551205 DOI: 10.1038/s41746-021-00519-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 09/13/2021] [Indexed: 11/11/2022] Open
Abstract
The increasing availability of electronic health record (EHR) systems has created enormous potential for translational research. However, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the ability to share data across institutions. In this project, we demonstrate that multi-center large-scale code embeddings can be used to efficiently identify relevant features related to a disease of interest. We constructed large-scale code embeddings for a wide range of codified concepts from EHRs from two large medical centers. We developed knowledge extraction via sparse embedding regression (KESER) for feature selection and integrative network analysis. We evaluated the quality of the code embeddings and assessed the performance of KESER in feature selection for eight diseases. Besides, we developed an integrated clinical knowledge map combining embedding data from both institutions. The features selected by KESER were comprehensive compared to lists of codified data generated by domain experts. Features identified via KESER resulted in comparable performance to those built upon features selected manually or with patient-level data. The knowledge map created using an integrative analysis identified disease-disease and disease-drug pairs more accurately compared to those identified using single institution data. Analysis of code embeddings via KESER can effectively reveal clinical knowledge and infer relatedness among codified concepts. KESER bypasses the need for patient-level data in individual analyses providing a significant advance in enabling multi-center studies using EHR data.
Collapse
Affiliation(s)
- Chuan Hong
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
| | - Everett Rush
- Department of Energy, Oak Ridge National Lab, Oak Ridge, TN, USA
| | - Molei Liu
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | | - Jiehuan Sun
- University of Illinois at Chicago, Chicago, IL, USA
| | - Aaron Sonabend
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | | | | | | - Tianrun Cai
- VA Boston Healthcare System, Boston, MA, USA
- Mass General Brigham, Boston, MA, USA
| | | | - Zeling He
- Mass General Brigham, Boston, MA, USA
| | | | | | - J Michael Gaziano
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
| | | | | | - Yuk-Lam Ho
- VA Boston Healthcare System, Boston, MA, USA
| | - Edmon Begoli
- Department of Energy, Oak Ridge National Lab, Oak Ridge, TN, USA
| | - Junwei Lu
- VA Boston Healthcare System, Boston, MA, USA
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Kelly Cho
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
| | - Katherine P Liao
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
| | - Tianxi Cai
- Harvard Medical School, Boston, MA, USA.
- VA Boston Healthcare System, Boston, MA, USA.
- Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
20
|
Ziletti A, Berns C, Treichel O, Weber T, Liang J, Kammerath S, Schwaerzler M, Virayah J, Ruau D, Ma X, Mattern A. Discovering Key Topics From Short, Real-World Medical Inquiries via Natural Language Processing. FRONTIERS IN COMPUTER SCIENCE 2021. [DOI: 10.3389/fcomp.2021.672867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Millions of unsolicited medical inquiries are received by pharmaceutical companies every year. It has been hypothesized that these inquiries represent a treasure trove of information, potentially giving insight into matters regarding medicinal products and the associated medical treatments. However, due to the large volume and specialized nature of the inquiries, it is difficult to perform timely, recurrent, and comprehensive analyses. Here, we combine biomedical word embeddings, non-linear dimensionality reduction, and hierarchical clustering to automatically discover key topics in real-world medical inquiries from customers. This approach does not require ontologies nor annotations. The discovered topics are meaningful and medically relevant, as judged by medical information specialists, thus demonstrating that unsolicited medical inquiries are a source of valuable customer insights. Our work paves the way for the machine-learning-driven analysis of medical inquiries in the pharmaceutical industry, which ultimately aims at improving patient care.
Collapse
|
21
|
De Freitas JK, Johnson KW, Golden E, Nadkarni GN, Dudley JT, Bottinger EP, Glicksberg BS, Miotto R. Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records. PATTERNS (NEW YORK, N.Y.) 2021; 2:100337. [PMID: 34553174 PMCID: PMC8441576 DOI: 10.1016/j.patter.2021.100337] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 06/30/2021] [Accepted: 08/05/2021] [Indexed: 11/23/2022]
Abstract
Robust phenotyping of patients from electronic health records (EHRs) at scale is a challenge in clinical informatics. Here, we introduce Phe2vec, an automated framework for disease phenotyping from EHRs based on unsupervised learning and assess its effectiveness against standard rule-based algorithms from Phenotype KnowledgeBase (PheKB). Phe2vec is based on pre-computing embeddings of medical concepts and patients' clinical history. Disease phenotypes are then derived from a seed concept and its neighbors in the embedding space. Patients are linked to a disease if their embedded representation is close to the disease phenotype. Comparing Phe2vec and PheKB cohorts head-to-head using chart review, Phe2vec performed on par or better in nine out of ten diseases. Differently from other approaches, it can scale to any condition and was validated against widely adopted expert-based standards. Phe2vec aims to optimize clinical informatics research by augmenting current frameworks to characterize patients by condition and derive reliable disease cohorts.
Collapse
Affiliation(s)
- Jessica K. De Freitas
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Kipp W. Johnson
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Eddye Golden
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Girish N. Nadkarni
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Medicine, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Joel T. Dudley
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Erwin P. Bottinger
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Medicine, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Digital Health Center at Hasso Plattner Institute, University of Potsdam, Professor-Dr.-Helmert-Str 2–3, 14482 Potsdam, Germany
| | - Benjamin S. Glicksberg
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| | - Riccardo Miotto
- Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029, USA
| |
Collapse
|
22
|
Abstract
Machine learning can be used to make sense of healthcare data. Probabilistic machine learning models help provide a complete picture of observed data in healthcare. In this review, we examine how probabilistic machine learning can advance healthcare. We consider challenges in the predictive model building pipeline where probabilistic models can be beneficial, including calibration and missing data. Beyond predictive models, we also investigate the utility of probabilistic machine learning models in phenotyping, in generative models for clinical use cases, and in reinforcement learning.
Collapse
Affiliation(s)
- Irene Y Chen
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA;
| | | | - Marzyeh Ghassemi
- Vector Institute, Toronto, Ontario M5G 1M1, Canada; .,Institute for Medical and Evaluative Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Rajesh Ranganath
- Department of Computer Science, Courant Institute, New York University, New York, NY 10012, USA.,Center for Data Science, New York University, New York, NY 10012, USA.,Department of Population Health, New York University Grossman School of Medicine, New York, NY 10016, USA
| |
Collapse
|
23
|
Evaluation of clustering and topic modeling methods over health-related tweets and emails. Artif Intell Med 2021; 117:102096. [PMID: 34127235 DOI: 10.1016/j.artmed.2021.102096] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Revised: 03/30/2021] [Accepted: 05/05/2021] [Indexed: 01/31/2023]
Abstract
BACKGROUND Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts. METHODS We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels). RESULTS In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets. CONCLUSIONS Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.
Collapse
|
24
|
Malec SA, Wei P, Bernstam EV, Boyce RD, Cohen T. Using computable knowledge mined from the literature to elucidate confounders for EHR-based pharmacovigilance. J Biomed Inform 2021; 117:103719. [PMID: 33716168 PMCID: PMC8559730 DOI: 10.1016/j.jbi.2021.103719] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 12/31/2020] [Accepted: 01/04/2021] [Indexed: 10/21/2022]
Abstract
INTRODUCTION Drug safety research asks causal questions but relies on observational data. Confounding bias threatens the reliability of studies using such data. The successful control of confounding requires knowledge of variables called confounders affecting both the exposure and outcome of interest. However, causal knowledge of dynamic biological systems is complex and challenging. Fortunately, computable knowledge mined from the literature may hold clues about confounders. In this paper, we tested the hypothesis that incorporating literature-derived confounders can improve causal inference from observational data. METHODS We introduce two methods (semantic vector-based and string-based confounder search) that query literature-derived information for confounder candidates to control, using SemMedDB, a database of computable knowledge mined from the biomedical literature. These methods search SemMedDB for confounders by applying semantic constraint search for indications treated by the drug (exposure) and that are also known to cause the adverse event (outcome). We then include the literature-derived confounder candidates in statistical and causal models derived from free-text clinical notes. For evaluation, we use a reference dataset widely used in drug safety containing labeled pairwise relationships between drugs and adverse events and attempt to rediscover these relationships from a corpus of 2.2 M NLP-processed free-text clinical notes. We employ standard adjustment and causal inference procedures to predict and estimate causal effects by informing the models with varying numbers of literature-derived confounders and instantiating the exposure, outcome, and confounder variables in the models with dichotomous EHR-derived data. Finally, we compare the results from applying these procedures with naive measures of association (χ2 and reporting odds ratio) and with each other. RESULTS AND CONCLUSIONS We found semantic vector-based search to be superior to string-based search at reducing confounding bias. However, the effect of including more rather than fewer literature-derived confounders was inconclusive. We recommend using targeted learning estimation methods that can address treatment-confounder feedback, where confounders also behave as intermediate variables, and engaging subject-matter experts to adjudicate the handling of problematic covariates.
Collapse
Affiliation(s)
- Scott A Malec
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States.
| | - Peng Wei
- The University of Texas MD Anderson Cancer Center, Department of Biostatistics, Houston, TX, United States
| | - Elmer V Bernstam
- University of Texas Health Science Center at Houston, School of Biomedical Informatics, Houston, TX, United States
| | - Richard D Boyce
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States
| | - Trevor Cohen
- University of Washington, Department of Biomedical Informatics and Medical Education, Seattle, WA, United States
| |
Collapse
|
25
|
Mining heterogeneous clinical notes by multi-modal latent topic model. PLoS One 2021; 16:e0249622. [PMID: 33831055 PMCID: PMC8031429 DOI: 10.1371/journal.pone.0249622] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2020] [Accepted: 03/22/2021] [Indexed: 11/19/2022] Open
Abstract
Latent knowledge can be extracted from the electronic notes that are recorded during patient encounters with the health system. Using these clinical notes to decipher a patient’s underlying comorbidites, symptom burdens, and treatment courses is an ongoing challenge. Latent topic model as an efficient Bayesian method can be used to model each patient’s clinical notes as “documents” and the words in the notes as “tokens”. However, standard latent topic models assume that all of the notes follow the same topic distribution, regardless of the type of note or the domain expertise of the author (such as doctors or nurses). We propose a novel application of latent topic modeling, using multi-note topic model (MNTM) to jointly infer distinct topic distributions of notes of different types. We applied our model to clinical notes from the MIMIC-III dataset to infer distinct topic distributions over the physician and nursing note types. Based on manual assessments made by clinicians, we observed a significant improvement in topic interpretability using MNTM modeling over the baseline single-note topic models that ignore the note types. Moreover, our MNTM model led to a significantly higher prediction accuracy for prolonged mechanical ventilation and mortality using only the first 48 hours of patient data. By correlating the patients’ topic mixture with hospital mortality and prolonged mechanical ventilation, we identified several diagnostic topics that are associated with poor outcomes. Because of its elegant and intuitive formation, we envision a broad application of our approach in mining multi-modality text-based healthcare information that goes beyond clinical notes. Code available at https://github.com/li-lab-mcgill/heterogeneous_ehr.
Collapse
|
26
|
Ahuja Y, Zhou D, He Z, Sun J, Castro VM, Gainer V, Murphy SN, Hong C, Cai T. sureLDA: A multidisease automated phenotyping method for the electronic health record. J Am Med Inform Assoc 2021; 27:1235-1243. [PMID: 32548637 DOI: 10.1093/jamia/ocaa079] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Revised: 03/12/2020] [Accepted: 04/28/2020] [Indexed: 01/20/2023] Open
Abstract
OBJECTIVE A major bottleneck hindering utilization of electronic health record data for translational research is the lack of precise phenotype labels. Chart review as well as rule-based and supervised phenotyping approaches require laborious expert input, hampering applicability to studies that require many phenotypes to be defined and labeled de novo. Though International Classification of Diseases codes are often used as surrogates for true labels in this setting, these sometimes suffer from poor specificity. We propose a fully automated topic modeling algorithm to simultaneously annotate multiple phenotypes. MATERIALS AND METHODS Surrogate-guided ensemble latent Dirichlet allocation (sureLDA) is a label-free multidimensional phenotyping method. It first uses the PheNorm algorithm to initialize probabilities based on 2 surrogate features for each target phenotype, and then leverages these probabilities to constrain the LDA topic model to generate phenotype-specific topics. Finally, it combines phenotype-feature counts with surrogates via clustering ensemble to yield final phenotype probabilities. RESULTS sureLDA achieves reliably high accuracy and precision across a range of simulated and real-world phenotypes. Its performance is robust to phenotype prevalence and relative informativeness of surogate vs nonsurrogate features. It also exhibits powerful feature selection properties. DISCUSSION sureLDA combines attractive properties of PheNorm and LDA to achieve high accuracy and precision robust to diverse phenotype characteristics. It offers particular improvement for phenotypes insufficiently captured by a few surrogate features. Moreover, sureLDA's feature selection ability enables it to handle high feature dimensions and produce interpretable computational phenotypes. CONCLUSIONS sureLDA is well suited toward large-scale electronic health record phenotyping for highly multiphenotype applications such as phenome-wide association studies .
Collapse
Affiliation(s)
- Yuri Ahuja
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Harvard Medical School, Boston, Massachusetts, USA
| | - Doudou Zhou
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Statistics, University of California, Davis, Davis, California, USA
| | - Zeling He
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Jiehuan Sun
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Massachusetts Veterans Epidemiology Research and Information Center, VA Boston Healthcare System, Boston, Massachusetts, USA
| | | | - Vivian Gainer
- Partners HealthCare, Charlestown, Massachusetts, USA
| | - Shawn N Murphy
- Harvard Medical School, Boston, Massachusetts, USA.,Partners HealthCare, Charlestown, Massachusetts, USA
| | - Chuan Hong
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Harvard Medical School, Boston, Massachusetts, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA.,Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts Veterans Epidemiology Research and Information Center, VA Boston Healthcare System, Boston, Massachusetts, USA
| |
Collapse
|
27
|
Sheng JQ, Hu PJH, Liu X, Huang TS, Chen YH. Predictive Analytics for Care and Management of Patients With Acute Diseases: Deep Learning-Based Method to Predict Crucial Complication Phenotypes. J Med Internet Res 2021; 23:e18372. [PMID: 33576744 PMCID: PMC7910123 DOI: 10.2196/18372] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2020] [Revised: 09/13/2020] [Accepted: 12/21/2020] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Acute diseases present severe complications that develop rapidly, exhibit distinct phenotypes, and have profound effects on patient outcomes. Predictive analytics can enhance physicians' care and management of patients with acute diseases by predicting crucial complication phenotypes for a timely diagnosis and treatment. However, effective phenotype predictions require several challenges to be overcome. First, patient data collected in the early stages of an acute disease (eg, clinical data and laboratory results) are less informative for predicting phenotypic outcomes. Second, patient data are temporal and heterogeneous; for example, patients receive laboratory tests at different time intervals and frequencies. Third, imbalanced distributions of patient outcomes create additional complexity for predicting complication phenotypes. OBJECTIVE To predict crucial complication phenotypes among patients with acute diseases, we propose a novel, deep learning-based method that uses recurrent neural network-based sequence embedding to represent disease progression while considering temporal heterogeneities in patient data. Our method incorporates a latent regulator to alleviate data insufficiency constraints by accounting for the underlying mechanisms that are not observed in patient data. The proposed method also includes cost-sensitive learning to address imbalanced outcome distributions in patient data for improved predictions. METHODS From a major health care organization in Taiwan, we obtained a sample of 10,354 electronic health records that pertained to 6545 patients with peritonitis. The proposed method projects these temporal, heterogeneous, and clinical data into a substantially reduced feature space and then incorporates a latent regulator (latent parameter matrix) to obviate data insufficiencies and account for variations in phenotypic expressions. Moreover, our method employs cost-sensitive learning to further increase the predictive performance. RESULTS We evaluated the efficacy of the proposed method for predicting two hepatic complication phenotypes in patients with peritonitis: acute hepatic encephalopathy and hepatorenal syndrome. The following three benchmark techniques were evaluated: temporal multiple measurement case-based reasoning (MMCBR), temporal short long-term memory (T-SLTM) networks, and time fusion convolutional neural network (CNN). For acute hepatic encephalopathy predictions, our method attained an area under the curve (AUC) value of 0.82, which outperforms temporal MMCBR by 64%, T-SLTM by 26%, and time fusion CNN by 26%. For hepatorenal syndrome predictions, our method achieved an AUC value of 0.64, which is 29% better than that of temporal MMCBR (0.54). Overall, the evaluation results show that the proposed method significantly outperforms all the benchmarks, as measured by recall, F-measure, and AUC while maintaining comparable precision values. CONCLUSIONS The proposed method learns a short-term temporal representation from patient data to predict complication phenotypes and offers greater predictive utilities than prevalent data-driven techniques. This method is generalizable and can be applied to different acute disease (illness) scenarios that are characterized by insufficient patient clinical data availability, temporal heterogeneities, and imbalanced distributions of important patient outcomes.
Collapse
Affiliation(s)
- Jessica Qiuhua Sheng
- Department of Operations and Information Systems, David Eccles School of Business, University of Utah, Salt Lake City, UT, United States
| | - Paul Jen-Hwa Hu
- Department of Operations and Information Systems, David Eccles School of Business, University of Utah, Salt Lake City, UT, United States
| | - Xiao Liu
- Department of Information Systems, WP Carey School of Business, Arizona State University, Phoenix, AZ, United States
| | - Ting-Shuo Huang
- Department of General Surgery and Community Medicine Research Center, Keelung Chang Gung Memorial Hospital, Keelung, Taiwan
| | - Yu Hsien Chen
- Department of Chinese Medicine, College of Medicine, Chang Gung University, Taoyuan, Chang Gung, Taiwan
| |
Collapse
|
28
|
Saeedi A, Yadollahpour P, Singla S, Pollack B, Wells W, Sciurba F, Batmanghelich K. Incorporating External Information in Tissue Subtyping: A Topic Modeling Approach. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2021; 149:478-505. [PMID: 35098143 PMCID: PMC8797254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Probabilistic topic models, have been widely deployed for various applications such as learning disease or tissue subtypes. Yet, learning the parameters of such models is usually an ill-posed problem and may result in losing valuable information about disease severity. A common approach is to add a discriminative loss term to the generative model's loss in order to learn a representation that is also predictive of disease severity. However, finding a balance between these two losses is not straightforward. We propose an alternative way in this paper. We develop a framework which allows for incorporating external covariates into the generative model's approximate posterior. These covariates can have more discriminative power for disease severity compared to the representation that we extract from the posterior distribution. For instance, they can be features extracted from a neural network which predicts disease severity from CT images. Effectively, we enforce the generative model's approximate posterior to reside in the subspace of these discriminative covariates. We illustrate our method's application on a large-scale lung CT study of Chronic Obstructive Pulmonary Disease (COPD), a highly heterogeneous disease. We aim at identifying tissue subtypes by using a variant of topic model as a generative model. We quantitatively evaluate the predictive performance of the inferred subtypes and demonstrate that our method outperforms or performs on par with some reasonable baselines. We also show that some of the discovered subtypes are correlated with genetic measurements, suggesting that the identified subtypes may characterize the disease's underlying etiology.
Collapse
Affiliation(s)
| | | | | | | | - William Wells
- Harvard Medical School / Brigham and Women's Hospital
| | | | | |
Collapse
|
29
|
Zhuo L, Cheng Y, Liu S, Yang Y, Tang S, Zhen J, Zhao J, Zhan S. A Multiview Model for Detecting the Inappropriate Use of Prescription Medication: Machine Learning Approach. JMIR Med Inform 2020; 8:e16312. [PMID: 32209527 PMCID: PMC7381037 DOI: 10.2196/16312] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Revised: 01/18/2020] [Accepted: 03/24/2020] [Indexed: 01/22/2023] Open
Abstract
Background The inappropriate use of prescription medication has recently garnered worldwide attention, but most national policies do not effectively provide for early detection or timely intervention. Objective This study aimed to develop and assess the validity of a model that can detect the inappropriate use of prescription medication. This effort combines a multiview and topic matching method. The study also assessed the validity of this approach. Methods A multiview extension of the latent Dirichlet allocation algorithm for topic modeling was chosen to generate diagnosis-medication topics, with data obtained from the Chinese Monitoring Network for Rational Use of Drugs (CMNRUD) database. Topic mapping allowed for calculating the degree to which diagnoses and medications were similarly distributed and, by setting a threshold, for identifying prescription misuse. The Beijing Regional Prescription Review Database (BRPRD) database was used as the gold standard to assess the model’s validity. We also conducted a sensitivity analysis using random samples of validated prescriptions and evaluated the model’s performance. Results A total of 44 million prescriptions were used to generate topics using the diagnoses and medications from the CMNRUD database. A random sample (15,000 prescriptions) from the BRPRD was used for validation, and it was found that the model had a sensitivity of 81.8%, specificity of 47.4%, positive-predictive value of 14.5%, and negative-predictive value of 96.0%. The model showed superior stability under different sampling proportions. Conclusions A method that combines multiview topic modeling and topic matching can detect the inappropriate use of prescription medication. This model, which has mediocre specificity and moderate sensitivity, can be used as a primary screening tool and will likely complement and improve the process of manually reviewing prescriptions.
Collapse
Affiliation(s)
- Lin Zhuo
- Research Center of Clinical Epidemiology, Peking University Third Hospital, Beijing, China.,Department of Epidemiology and Biostatistics, School of Public Health, Peking University, Beijing, China
| | - Yinchu Cheng
- Department of Pharmacy, Peking University Third Hospital, Beijing, China
| | - Shaoqin Liu
- School of Electronics Engineering and Computer Science, Peking University, Beijing, China
| | - Yu Yang
- Center for Data Science in Medicine and Health, Peking University, Beijing, China
| | - Shuang Tang
- School of Electronics Engineering and Computer Science, Peking University, Beijing, China
| | - Jiancun Zhen
- Department of Pharmacy, Ji Shui Tan Hospital and Fourth Medical College of Peking University, Beijing, China
| | - Junfeng Zhao
- School of Electronics Engineering and Computer Science, Peking University, Beijing, China
| | - Siyan Zhan
- Research Center of Clinical Epidemiology, Peking University Third Hospital, Beijing, China.,Department of Epidemiology and Biostatistics, School of Public Health, Peking University, Beijing, China
| |
Collapse
|
30
|
Burgermaster M, Son JH, Davidson PG, Smaldone AM, Kuperman G, Feller DJ, Burt KG, Levine ME, Albers DJ, Weng C, Mamykina L. A new approach to integrating patient-generated data with expert knowledge for personalized goal setting: A pilot study. Int J Med Inform 2020; 139:104158. [PMID: 32388157 PMCID: PMC7332366 DOI: 10.1016/j.ijmedinf.2020.104158] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Revised: 02/19/2020] [Accepted: 04/23/2020] [Indexed: 12/17/2022]
Abstract
INTRODUCTION Self-monitoring technologies produce patient-generated data that could be leveraged to personalize nutritional goal setting to improve population health; however, most computational approaches are limited when applied to individual-level personalization with sparse and irregular self-monitoring data. We applied informatics methods from expert suggestion systems to a challenging clinical problem: generating personalized nutrition goals from patient-generated diet and blood glucose data. MATERIALS AND METHODS We applied qualitative process coding and decision tree modeling to understand how registered dietitians translate patient-generated data into recommendations for dietary self-management of diabetes (i.e., knowledge model). We encoded this process in a set of functions that take diet and blood glucose data as an input and output diet recommendations (i.e., inference engine). Dietitians assessed face validity. Using four patient datasets, we compared our inference engine's output to clinical narratives and gold standards developed by expert clinicians. RESULTS To dietitians, the knowledge model represented how recommendations from patient data are made. Inference engine recommendations were 63 % consistent with the gold standard (range = 42 %-75 %) and 74 % consistent with narrative clinical observations (range = 63 %-83 %). DISCUSSION Qualitative modeling and automating how dietitians reason over patient data resulted in a knowledge model representing clinical knowledge. However, our knowledge model was less consistent with gold standard than narrative clinical recommendations, raising questions about how best to evaluate approaches that integrate patient-generated data with expert knowledge. CONCLUSION New informatics approaches that integrate data-driven methods with expert decision making for personalized goal setting, such as the knowledge base and inference engine presented here, demonstrate the potential to extend the reach of patient-generated data by synthesizing it with clinical knowledge. However, important questions remain about the strengths and weaknesses of computer algorithms developed to discern signal from patient-generated data compared to human experts.
Collapse
Affiliation(s)
- Marissa Burgermaster
- Nutritional Sciences & Population Health, University of Texas at Austin, Austin, TX, USA; Biomedical Informatics, Columbia University, New York, NY, USA.
| | - Jung H Son
- Biomedical Informatics, Columbia University, New York, NY, USA
| | | | - Arlene M Smaldone
- School of Nursing & College of Dental Medicine, Columbia University, New York, NY, USA
| | - Gilad Kuperman
- Biomedical Informatics, Columbia University, New York, NY, USA; Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Daniel J Feller
- Biomedical Informatics, Columbia University, New York, NY, USA
| | | | | | - David J Albers
- Biomedical Informatics, Columbia University, New York, NY, USA; Pediatrics & Informatics, University of Colorado, Aurora, CO, USA
| | - Chunhua Weng
- Biomedical Informatics, Columbia University, New York, NY, USA
| | - Lena Mamykina
- Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
31
|
Reducing Bias Due to Outcome Misclassification for Epidemiologic Studies Using EHR-derived Probabilistic Phenotypes. Epidemiology 2020; 31:542-550. [DOI: 10.1097/ede.0000000000001193] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
32
|
Urteaga I, McKillop M, Elhadad N. Learning endometriosis phenotypes from patient-generated data. NPJ Digit Med 2020; 3:88. [PMID: 32596513 PMCID: PMC7314826 DOI: 10.1038/s41746-020-0292-9] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 05/26/2020] [Indexed: 12/19/2022] Open
Abstract
Endometriosis is a systemic and chronic condition in women of childbearing age, yet a highly enigmatic disease with unresolved questions: there are no known biomarkers, nor established clinical stages. We here investigate the use of patient-generated health data and data-driven phenotyping to characterize endometriosis patient subtypes, based on their reported signs and symptoms. We aim at unsupervised learning of endometriosis phenotypes using self-tracking data from personal smartphones. We leverage data from an observational research study of over 4000 women with endometriosis that track their condition over more than 2 years. We extend a classical mixed-membership model to accommodate the idiosyncrasies of the data at hand, i.e., the multimodality and uncertainty of the self-tracked variables. The proposed method, by jointly modeling a wide range of observations (i.e., participant symptoms, quality of life, treatments), identifies clinically relevant endometriosis subtypes. Experiments show that our method is robust to different hyperparameter choices and the biases of self-tracking data (e.g., the wide variations in tracking frequency among participants). With this work, we show the promise of unsupervised learning of endometriosis subtypes from self-tracked data, as learned phenotypes align well with what is already known about the disease, but also suggest new clinically actionable findings. More generally, we argue that a continued research effort on unsupervised phenotyping methods with patient-generated health data via new mobile and digital technologies will have significant impact on the study of enigmatic diseases in particular, and health in general.
Collapse
Affiliation(s)
- Iñigo Urteaga
- Department of Applied Physics and Applied Mathematics, Columbia University, New York, NY 10027 USA
- Data Science Institute, Columbia University, New York, NY 10027 USA
| | - Mollie McKillop
- Department of Biomedical Informatics, Columbia University, New York, NY 10032 USA
| | - Noémie Elhadad
- Data Science Institute, Columbia University, New York, NY 10027 USA
- Department of Biomedical Informatics, Columbia University, New York, NY 10032 USA
| |
Collapse
|
33
|
Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:191-200. [PMID: 32477638 PMCID: PMC7233077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Modern electronic health records (EHRs) provide data to answer clinically meaningful questions. The growing data in EHRs makes healthcare ripe for the use of machine learning. However, learning in a clinical setting presents unique challenges that complicate the use of common machine learning methodologies. For example, diseases in EHRs are poorly labeled, conditions can encompass multiple underlying endotypes, and healthy individuals are underrepresented. This article serves as a primer to illuminate these challenges and highlights opportunities for members of the machine learning community to contribute to healthcare.
Collapse
Affiliation(s)
| | | | | | | | - Irene Y Chen
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | | |
Collapse
|
34
|
Li Y, Nair P, Lu XH, Wen Z, Wang Y, Dehaghi AAK, Miao Y, Liu W, Ordog T, Biernacka JM, Ryu E, Olson JE, Frye MA, Liu A, Guo L, Marelli A, Ahuja Y, Davila-Velderrain J, Kellis M. Inferring multimodal latent topics from electronic health records. Nat Commun 2020; 11:2536. [PMID: 32439869 PMCID: PMC7242436 DOI: 10.1038/s41467-020-16378-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Accepted: 04/23/2020] [Indexed: 11/10/2022] Open
Abstract
Electronic health records (EHR) are rich heterogeneous collections of patient health information, whose broad adoption provides clinicians and researchers unprecedented opportunities for health informatics, disease-risk prediction, actionable clinical recommendations, and precision medicine. However, EHRs present several modeling challenges, including highly sparse data matrices, noisy irregular clinical notes, arbitrary biases in billing code assignment, diagnosis-driven lab tests, and heterogeneous data types. To address these challenges, we present MixEHR, a multi-view Bayesian topic model. We demonstrate MixEHR on MIMIC-III, Mayo Clinic Bipolar Disorder, and Quebec Congenital Heart Disease EHR datasets. Qualitatively, MixEHR disease topics reveal meaningful combinations of clinical features across heterogeneous data types. Quantitatively, we observe superior prediction accuracy of diagnostic codes and lab test imputations compared to the state-of-art methods. We leverage the inferred patient topic mixtures to classify target diseases and predict mortality of patients in critical conditions. In all comparison, MixEHR confers competitive performance and reveals meaningful disease-related topics.
Collapse
Affiliation(s)
- Yue Li
- School of Computer Science and McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, H3A0E9, Canada.
| | - Pratheeksha Nair
- School of Computer Science and McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, H3A0E9, Canada
| | - Xing Han Lu
- School of Computer Science and McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, H3A0E9, Canada
| | - Zhi Wen
- School of Computer Science and McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, H3A0E9, Canada
| | - Yuening Wang
- School of Computer Science and McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, H3A0E9, Canada
| | | | - Yan Miao
- School of Computer Science and McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, H3A0E9, Canada
| | - Weiqi Liu
- School of Computer Science and McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, H3A0E9, Canada
| | - Tamas Ordog
- Department of Physiology and Biomedical Engineering and Division of Gastroenterology and Hepatology, Department of Medicine, and Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Joanna M Biernacka
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
- Department of Psychiatry and Psychology, Mayo Clinic, Rochester, MN, USA
| | - Euijung Ryu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Janet E Olson
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Mark A Frye
- Department of Psychiatry and Psychology, Mayo Clinic, Rochester, MN, USA
| | - Aihua Liu
- McGill Adult Unit for Congenital Heart Disease Excellence (MAUDE Unit), Montreal, QC H4A 3J1, Quebec, Canada
| | - Liming Guo
- McGill Adult Unit for Congenital Heart Disease Excellence (MAUDE Unit), Montreal, QC H4A 3J1, Quebec, Canada
| | - Ariane Marelli
- McGill Adult Unit for Congenital Heart Disease Excellence (MAUDE Unit), Montreal, QC H4A 3J1, Quebec, Canada
| | - Yuri Ahuja
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA, 02139, USA
| | - Jose Davila-Velderrain
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA, 02139, USA
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA, 02139, USA.
- The Broad Institute of Harvard and MIT, 415 Main Street, Cambridge, MA, 02142, USA.
| |
Collapse
|
35
|
Lustgarten JL, Zehnder A, Shipman W, Gancher E, Webb TL. Veterinary informatics: forging the future between veterinary medicine, human medicine, and One Health initiatives-a joint paper by the Association for Veterinary Informatics (AVI) and the CTSA One Health Alliance (COHA). JAMIA Open 2020; 3:306-317. [PMID: 32734172 PMCID: PMC7382640 DOI: 10.1093/jamiaopen/ooaa005] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Revised: 12/26/2019] [Accepted: 02/26/2020] [Indexed: 12/25/2022] Open
Abstract
Objectives This manuscript reviews the current state of veterinary medical electronic health records and the ability to aggregate and analyze large datasets from multiple organizations and clinics. We also review analytical techniques as well as research efforts into veterinary informatics with a focus on applications relevant to human and animal medicine. Our goal is to provide references and context for these resources so that researchers can identify resources of interest and translational opportunities to advance the field. Methods and Results This review covers various methods of veterinary informatics including natural language processing and machine learning techniques in brief and various ongoing and future projects. After detailing techniques and sources of data, we describe some of the challenges and opportunities within veterinary informatics as well as providing reviews of common One Health techniques and specific applications that affect both humans and animals. Discussion Current limitations in the field of veterinary informatics include limited sources of training data for developing machine learning and artificial intelligence algorithms, siloed data between academic institutions, corporate institutions, and many small private practices, and inconsistent data formats that make many integration problems difficult. Despite those limitations, there have been significant advancements in the field in the last few years and continued development of a few, key, large data resources that are available for interested clinicians and researchers. These real-world use cases and applications show current and significant future potential as veterinary informatics grows in importance. Veterinary informatics can forge new possibilities within veterinary medicine and between veterinary medicine, human medicine, and One Health initiatives.
Collapse
Affiliation(s)
- Jonathan L Lustgarten
- Association for Veterinary Informatics, Dixon, California, USA.,VCA Inc., Health Technology & Informatics, Los Angeles, California, USA
| | | | - Wayde Shipman
- Veterinary Medical Databases, Columbia, Missouri, USA
| | - Elizabeth Gancher
- Department of Infectious diseases and HIV medicine, Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Tracy L Webb
- Department of Clinical Sciences, Colorado State University, Fort Collins, Colorado, USA
| |
Collapse
|
36
|
Bollig N, Clarke L, Elsmo E, Craven M. Machine learning for syndromic surveillance using veterinary necropsy reports. PLoS One 2020; 15:e0228105. [PMID: 32023271 PMCID: PMC7001958 DOI: 10.1371/journal.pone.0228105] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2019] [Accepted: 01/07/2020] [Indexed: 12/02/2022] Open
Abstract
The use of natural language data for animal population surveillance represents a valuable opportunity to gather information about potential disease outbreaks, emerging zoonotic diseases, or bioterrorism threats. In this study, we evaluate machine learning methods for conducting syndromic surveillance using free-text veterinary necropsy reports. We train a system to detect if a necropsy report from the Wisconsin Veterinary Diagnostic Laboratory contains evidence of gastrointestinal, respiratory, or urinary pathology. We evaluate the performance of several machine learning algorithms including deep learning with a long short-term memory network. Although no single algorithm was superior, random forest using feature vectors of TF-IDF statistics ranked among the top-performing models with F1 scores of 0.923 (gastrointestinal), 0.960 (respiratory), and 0.888 (urinary). This model was applied to over 33,000 necropsy reports and was used to describe temporal and spatial features of diseases within a 14-year period, exposing epidemiological trends and detecting a potential focus of gastrointestinal disease from a single submitting producer in the fall of 2016.
Collapse
Affiliation(s)
- Nathan Bollig
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, United States of America
- Department of Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, United States of America
| | - Lorelei Clarke
- Wisconsin Veterinary Diagnostic Laboratory, University of Wisconsin-Madison, Madison, WI, United States of America
| | - Elizabeth Elsmo
- Wisconsin Veterinary Diagnostic Laboratory, University of Wisconsin-Madison, Madison, WI, United States of America
| | - Mark Craven
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, United States of America
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, United States of America
| |
Collapse
|
37
|
Wang Y, Zhao Y, Therneau TM, Atkinson EJ, Tafti AP, Zhang N, Amin S, Limper AH, Khosla S, Liu H. Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records. J Biomed Inform 2020; 102:103364. [PMID: 31891765 PMCID: PMC7028517 DOI: 10.1016/j.jbi.2019.103364] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Revised: 12/16/2019] [Accepted: 12/23/2019] [Indexed: 01/12/2023]
Abstract
Machine learning has become ubiquitous and a key technology on mining electronic health records (EHRs) for facilitating clinical research and practice. Unsupervised machine learning, as opposed to supervised learning, has shown promise in identifying novel patterns and relations from EHRs without using human created labels. In this paper, we investigate the application of unsupervised machine learning models in discovering latent disease clusters and patient subgroups based on EHRs. We utilized Latent Dirichlet Allocation (LDA), a generative probabilistic model, and proposed a novel model named Poisson Dirichlet Model (PDM), which extends the LDA approach using a Poisson distribution to model patients' disease diagnoses and to alleviate age and sex factors by considering both observed and expected observations. In the empirical experiments, we evaluated LDA and PDM on three patient cohorts, namely Osteoporosis, Delirium/Dementia, and Chronic Obstructive Pulmonary Disease (COPD)/Bronchiectasis Cohorts, with their EHR data retrieved from the Rochester Epidemiology Project (REP) medical records linkage system, for the discovery of latent disease clusters and patient subgroups. We compared the effectiveness of LDA and PDM in identifying disease clusters through the visualization of disease representations. We tested the performance of LDA and PDM in differentiating patient subgroups through survival analysis, as well as statistical analysis of demographics and Elixhauser Comorbidity Index (ECI) scores in those subgroups. The experimental results show that the proposed PDM could effectively identify distinguished disease clusters based on the latent patterns hidden in the EHR data by alleviating the impact of age and sex, and that LDA could stratify patients into differentiable subgroups with larger p-values than PDM. However, those subgroups identified by LDA are highly associated with patients' age and sex. The subgroups discovered by PDM might imply the underlying patterns of diseases of greater interest in epidemiology research due to the alleviation of age and sex. Both unsupervised machine learning approaches could be leveraged to discover patient subgroups using EHRs but with different foci.
Collapse
Affiliation(s)
- Yanshan Wang
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| | - Yiqing Zhao
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Terry M Therneau
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Elizabeth J Atkinson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Ahmad P Tafti
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Nan Zhang
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Shreyasee Amin
- Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Andrew H Limper
- Division of Pulmonary and Critical Care Medicine, Department of Internal Medicine, Mayo Clinic, Rochester, MN, USA
| | - Sundeep Khosla
- Division of Endocrinology and Kogod Center on Aging, Department of Internal Medicine, Mayo Clinic, Rochester, MN, USA
| | - Hongfang Liu
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
38
|
Xu Z, Chou J, Zhang XS, Luo Y, Isakova T, Adekkanattu P, Ancker JS, Jiang G, Kiefer RC, Pacheco JA, Rasmussen LV, Pathak J, Wang F. Identifying sub-phenotypes of acute kidney injury using structured and unstructured electronic health record data with memory networks. J Biomed Inform 2020; 102:103361. [PMID: 31911172 DOI: 10.1016/j.jbi.2019.103361] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Revised: 11/18/2019] [Accepted: 12/16/2019] [Indexed: 01/08/2023]
Abstract
Acute Kidney Injury (AKI) is a common clinical syndrome characterized by the rapid loss of kidney excretory function, which aggravates the clinical severity of other diseases in a large number of hospitalized patients. Accurate early prediction of AKI can enable in-time interventions and treatments. However, AKI is highly heterogeneous, thus identification of AKI sub-phenotypes can lead to an improved understanding of the disease pathophysiology and development of more targeted clinical interventions. This study used a memory network-based deep learning approach to discover AKI sub-phenotypes using structured and unstructured electronic health record (EHR) data of patients before AKI diagnosis. We leveraged a real world critical care EHR corpus including 37,486 ICU stays. Our approach identified three distinct sub-phenotypes: sub-phenotype I is with an average age of 63.03±17.25 years, and is characterized by mild loss of kidney excretory function (Serum Creatinine (SCr) 1.55±0.34 mg/dL, estimated Glomerular Filtration Rate Test (eGFR) 107.65±54.98 mL/min/1.73 m2). These patients are more likely to develop stage I AKI. Sub-phenotype II is with average age 66.81±10.43 years, and was characterized by severe loss of kidney excretory function (SCr 1.96±0.49 mg/dL, eGFR 82.19±55.92 mL/min/1.73 m2). These patients are more likely to develop stage III AKI. Sub-phenotype III is with average age 65.07±11.32 years, and was characterized moderate loss of kidney excretory function and thus more likely to develop stage II AKI (SCr 1.69±0.32 mg/dL, eGFR 93.97±56.53 mL/min/1.73 m2). Both SCr and eGFR are significantly different across the three sub-phenotypes with statistical testing plus postdoc analysis, and the conclusion still holds after age adjustment.
Collapse
Affiliation(s)
| | | | | | - Yuan Luo
- Northwestern University, Chicago, IL, USA
| | | | | | | | | | | | | | | | | | - Fei Wang
- Weill Cornell Medicine, New York, NY, USA.
| |
Collapse
|
39
|
Zhang L, Zhang Y, Cai T, Ahuja Y, He Z, Ho YL, Beam A, Cho K, Carroll R, Denny J, Kohane I, Liao K, Cai T. Automated grouping of medical codes via multiview banded spectral clustering. J Biomed Inform 2019; 100:103322. [PMID: 31672532 PMCID: PMC7261410 DOI: 10.1016/j.jbi.2019.103322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 10/25/2019] [Accepted: 10/27/2019] [Indexed: 01/28/2023]
Abstract
OBJECTIVE With its increasingly widespread adoption, electronic health records (EHR) have enabled phenotypic information extraction at an unprecedented granularity and scale. However, often a medical concept (e.g. diagnosis, prescription, symptom) is described in various synonyms across different EHR systems, hindering data integration for signal enhancement and complicating dimensionality reduction for knowledge discovery. Despite existing ontologies and hierarchies, tremendous human effort is needed for curation and maintenance - a process that is both unscalable and susceptible to subjective biases. This paper aims to develop a data-driven approach to automate grouping medical terms into clinically relevant concepts by combining multiple up-to-date data sources in an unbiased manner. METHODS We present a novel data-driven grouping approach - multi-view banded spectral clustering (mvBSC) combining summary data from multiple healthcare systems. The proposed method consists of a banding step that leverages the prior knowledge from the existing coding hierarchy, and a combining step that performs spectral clustering on an optimally weighted matrix. RESULTS We apply the proposed method to group ICD-9 and ICD-10-CM codes together by integrating data from two healthcare systems. We show grouping results and hierarchies for 13 representative disease categories. Individual grouping qualities were evaluated using normalized mutual information, adjusted Rand index, and F1-measure, and were found to consistently exhibit great similarity to the existing manual grouping counterpart. The resulting ICD groupings also enjoy comparable interpretability and are well aligned with the current ICD hierarchy. CONCLUSION The proposed approach, by systematically leveraging multiple data sources, is able to overcome bias while maximizing consensus to achieve generalizability. It has the advantage of being efficient, scalable, and adaptive to the evolving human knowledge reflected in the data, showing a significant step toward automating medical knowledge integration.
Collapse
Affiliation(s)
- Luwan Zhang
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Yichi Zhang
- Department of Computer Science and Statistics, University of Rhode Island, Kingston, RI, USA
| | - Tianrun Cai
- Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
| | - Yuri Ahuja
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Zeling He
- Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
| | - Yuk-Lam Ho
- Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
| | - Andrew Beam
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Kelly Cho
- Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Division of Aging, Brigham and Women's Hospital, Boston, MA, USA; Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Robert Carroll
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | - Joshua Denny
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Katherine Liao
- Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
40
|
Zhao J, Zhang Y, Schlueter DJ, Wu P, Eric Kerchberger V, Trent Rosenbloom S, Wells QS, Feng Q, Denny JC, Wei WQ. Detecting time-evolving phenotypic topics via tensor factorization on electronic health records: Cardiovascular disease case study. J Biomed Inform 2019; 98:103270. [PMID: 31445983 PMCID: PMC6783385 DOI: 10.1016/j.jbi.2019.103270] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 07/10/2019] [Accepted: 08/16/2019] [Indexed: 12/12/2022]
Abstract
OBJECTIVE Discovering subphenotypes of complex diseases can help characterize disease cohorts for investigative studies aimed at developing better diagnoses and treatments. Recent advances in unsupervised machine learning on electronic health record (EHR) data have enabled researchers to discover phenotypes without input from domain experts. However, most existing studies have ignored time and modeled diseases as discrete events. Uncovering the evolution of phenotypes - how they emerge, evolve and contribute to health outcomes - is essential to define more precise phenotypes and refine the understanding of disease progression. Our objective was to assess the benefits of an unsupervised approach that incorporates time to model diseases as dynamic processes in phenotype discovery. METHODS In this study, we applied a constrained non-negative tensor-factorization approach to characterize the complexity of cardiovascular disease (CVD) patient cohort based on longitudinal EHR data. Through tensor-factorization, we identified a set of phenotypic topics (i.e., subphenotypes) that these patients established over the 10 years prior to the diagnosis of CVD, and showed the progress pattern. For each identified subphenotype, we examined its association with the risk for adverse cardiovascular outcomes estimated by the American College of Cardiology/American Heart Association Pooled Cohort Risk Equations, a conventional CVD-risk assessment tool frequently used in clinical practice. Furthermore, we compared the subsequent myocardial infarction (MI) rates among the six most prevalent subphenotypes using survival analysis. RESULTS From a cohort of 12,380 adult CVD individuals with 1068 unique PheCodes, we successfully identified 14 subphenotypes. Through the association analysis with estimated CVD risk for each subtype, we found some phenotypic topics such as Vitamin D deficiency and depression, Urinary infections cannot be explained by the conventional risk factors. Through a survival analysis, we found markedly different risks of subsequent MI following the diagnosis of CVD among the six most prevalent topics (p < 0.0001), indicating these topics may capture clinically meaningful subphenotypes of CVD. CONCLUSION This study demonstrates the potential benefits of using tensor-decomposition to model diseases as dynamic processes from longitudinal EHR data. Our results suggest that this data-driven approach may potentially help researchers identify complex and chronic disease subphenotypes in precision medicine research.
Collapse
Affiliation(s)
- Juan Zhao
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Yun Zhang
- Fixed Income Division, Morgan Stanley & Co LLC, New York, NY, USA
| | - David J Schlueter
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Patrick Wu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Medical Scientist Training Program, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Vern Eric Kerchberger
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Division of Allergy, Pulmonary, and Critical Care Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - S Trent Rosenbloom
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Quinn S Wells
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - QiPing Feng
- Division of Clinical Pharmacology, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
41
|
Albers DJ, Levine ME, Mamykina L, Hripcsak G. The parameter Houlihan: A solution to high-throughput identifiability indeterminacy for brutally ill-posed problems. Math Biosci 2019; 316:108242. [PMID: 31454628 PMCID: PMC6759390 DOI: 10.1016/j.mbs.2019.108242] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 08/20/2019] [Accepted: 08/22/2019] [Indexed: 12/21/2022]
Abstract
One way to interject knowledge into clinically impactful forecasting is to use data assimilation, a nonlinear regression that projects data onto a mechanistic physiologic model, instead of a set of functions, such as neural networks. Such regressions have an advantage of being useful with particularly sparse, non-stationary clinical data. However, physiological models are often nonlinear and can have many parameters, leading to potential problems with parameter identifiability, or the ability to find a unique set of parameters that minimize forecasting error. The identifiability problems can be minimized or eliminated by reducing the number of parameters estimated, but reducing the number of estimated parameters also reduces the flexibility of the model and hence increases forecasting error. We propose a method, the parameter Houlihan, that combines traditional machine learning techniques with data assimilation, to select the right set of model parameters to minimize forecasting error while reducing identifiability problems. The method worked well: the data assimilation-based glucose forecasts and estimates for our cohort using the Houlihan-selected parameter sets generally also minimize forecasting errors compared to other parameter selection methods such as by-hand parameter selection. Nevertheless, the forecast with the lowest forecast error does not always accurately represent physiology, but further advancements of the algorithm provide a path for improving physiologic fidelity as well. Our hope is that this methodology represents a first step toward combining machine learning with data assimilation and provides a lower-threshold entry point for using data assimilation with clinical data by helping select the right parameters to estimate.
Collapse
Affiliation(s)
- David J Albers
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, PH-20, New York, NY, USA; Department of Pediatrics, Division of Informatics, University of Colorado Medicine, Mail: F443, 13199 E. Montview Blvd. Ste: 210-12 | Aurora, CO 80045 USA.
| | - Matthew E Levine
- Department of Computational and Mathematical sciences, California Institute of Technology, 1200 E California Blvd M/C 305-16 Pasadena, CA 91125 USA
| | - Lena Mamykina
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, PH-20, New York, NY, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, PH-20, New York, NY, USA
| |
Collapse
|
42
|
Rodriguez VA, Perotte A. Phenotype Inference with Semi-Supervised Mixed Membership Models. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2019; 106:304-324. [PMID: 32490377 PMCID: PMC7266114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Disease phenotyping algorithms are designed to sift through clinical data stores to identify patients with specific diseases. Supervised phenotyping methods require significant quantities of expert-labeled data, while unsupervised methods may learn spurious or non-disease phenotypes. To address these limitations, we propose the Semi-Supervised Mixed Membership Model (SS3M) - a probabilistic graphical model for learning disease phenotypes from partially labeled clinical data. We show SS3M can generate interpretable, disease-specific phenotypes which capture the clinical features of the disease concepts specified by the labels provided to the model. Furthermore, SS3M phenotypes demonstrate competitive predictive performance relative to commonly used baselines.
Collapse
Affiliation(s)
- Victor A Rodriguez
- Columbia University, Department of Biomedical Informatics, New York City, NY, USA
| | - Adler Perotte
- Columbia University, Department of Biomedical Informatics, New York City, NY, USA
| |
Collapse
|
43
|
Zhang Y, Nie A, Zehnder A, Page RL, Zou J. VetTag: improving automated veterinary diagnosis coding via large-scale language modeling. NPJ Digit Med 2019; 2:35. [PMID: 31304381 PMCID: PMC6550141 DOI: 10.1038/s41746-019-0113-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Accepted: 04/17/2019] [Indexed: 02/08/2023] Open
Abstract
Unlike human medical records, most of the veterinary records are free text without standard diagnosis coding. The lack of systematic coding is a major barrier to the growing interest in leveraging veterinary records for public health and translational research. Recent machine learning effort is limited to predicting 42 top-level diagnosis categories from veterinary notes. Here we develop a large-scale algorithm to automatically predict all 4577 standard veterinary diagnosis codes from free text. We train our algorithm on a curated dataset of over 100 K expert labeled veterinary notes and over one million unlabeled notes. Our algorithm is based on the adapted Transformer architecture and we demonstrate that large-scale language modeling on the unlabeled notes via pretraining and as an auxiliary objective during supervised learning greatly improves performance. We systematically evaluate the performance of the model and several baselines in challenging settings where algorithms trained on one hospital are evaluated in a different hospital with substantial domain shift. In addition, we show that hierarchical training can address severe data imbalances for fine-grained diagnosis with a few training cases, and we provide interpretation for what is learned by the deep network. Our algorithm addresses an important challenge in veterinary medicine, and our model and experiments add insights into the power of unsupervised learning for clinical natural language processing.
Collapse
Affiliation(s)
- Yuhui Zhang
- Department of Computer Science and Technology, Tsinghua University, Beijing, China
| | - Allen Nie
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
| | - Ashley Zehnder
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
| | - Rodney L. Page
- Department of Clinical Sciences, Colorado State University, Fort Collins, CO 80523 USA
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
- Chan-Zuckerberg Biohub, San Francisco, CA 94158 USA
| |
Collapse
|
44
|
Assale M, Dui LG, Cina A, Seveso A, Cabitza F. The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records. Front Med (Lausanne) 2019; 6:66. [PMID: 31058150 PMCID: PMC6478793 DOI: 10.3389/fmed.2019.00066] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Accepted: 03/18/2019] [Indexed: 01/01/2023] Open
Abstract
Problem: Clinical practice requires the production of a time- and resource-consuming great amount of notes. They contain relevant information, but their secondary use is almost impossible, due to their unstructured nature. Researchers are trying to address this problems, with traditional and promising novel techniques. Application in real hospital settings seems not to be possible yet, though, both because of relatively small and dirty dataset, and for the lack of language-specific pre-trained models. Aim: Our aim is to demonstrate the potential of the above techniques, but also raise awareness of the still open challenges that the scientific communities of IT and medical practitioners must jointly address to realize the full potential of unstructured content that is daily produced and digitized in hospital settings, both to improve its data quality and leverage the insights from data-driven predictive models. Methods: To this extent, we present a narrative literature review of the most recent and relevant contributions to leverage the application of Natural Language Processing techniques to the free-text content electronic patient records. In particular, we focused on four selected application domains, namely: data quality, information extraction, sentiment analysis and predictive models, and automated patient cohort selection. Then, we will present a few empirical studies that we undertook at a major teaching hospital specializing in musculoskeletal diseases. Results: We provide the reader with some simple and affordable pipelines, which demonstrate the feasibility of reaching literature performance levels with a single institution non-English dataset. In such a way, we bridged literature and real world needs, performing a step further toward the revival of notes fields.
Collapse
Affiliation(s)
- Michela Assale
- K-tree SRL, Pont-Saint-Martin, Italy
- University of Milano-Bicocca, Milan, Italy
| | - Linda Greta Dui
- Politecnico di Milano, Milan, Italy
- Link-Up Datareg, Cinisello Balsamo, Italy
| | - Andrea Cina
- K-tree SRL, Pont-Saint-Martin, Italy
- University of Milano-Bicocca, Milan, Italy
| | - Andrea Seveso
- University of Milano-Bicocca, Milan, Italy
- Link-Up Datareg, Cinisello Balsamo, Italy
| | - Federico Cabitza
- University of Milano-Bicocca, Milan, Italy
- IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|
45
|
Perros I, Papalexakis EE, Vuduc R, Searles E, Sun J. Temporal phenotyping of medically complex children via PARAFAC2 tensor factorization. J Biomed Inform 2019; 93:103125. [PMID: 30743070 DOI: 10.1016/j.jbi.2019.103125] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Revised: 01/17/2019] [Accepted: 01/29/2019] [Indexed: 10/27/2022]
Abstract
OBJECTIVE Our aim is to extract clinically-meaningful phenotypes from longitudinal electronic health records (EHRs) of medically-complex children. This is a fragile set of patients consuming a disproportionate amount of pediatric care resources but who often end up with sub-optimal clinical outcome. The rise in available electronic health records (EHRs) provide a rich data source that can be used to disentangle their complex clinical conditions into concise, clinically-meaningful groups of characteristics. We aim at identifying those phenotypes and their temporal evolution in a scalable, computational manner, which avoids the time-consuming manual chart review. MATERIALS AND METHODS We analyze longitudinal EHRs from Children's Healthcare of Atlanta including 1045 medically complex patients with a total of 59,948 encounters over 2 years. We apply a tensor factorization method called PARAFAC2 to extract: (a) clinically-meaningful groups of features (b) concise patient representations indicating the presence of a phenotype for each patient, and (c) temporal signatures indicating the evolution of those phenotypes over time for each patient. RESULTS We identified four medically complex phenotypes, namely gastrointestinal disorders, oncological conditions, blood-related disorders, and neurological system disorders, which have distinct clinical characterizations among patients. We demonstrate the utility of patient representations produced by PARAFAC2, towards identifying groups of patients with significant survival variations. Finally, we showcase representative examples of the temporal phenotypic trends extracted for different patients. DISCUSSION Unsupervised temporal phenotyping is an important task since it minimizes the burden on behalf of clinical experts, by relegating their involvement in the output phenotypes' validation. PARAFAC2 enjoys several compelling properties towards temporal computational phenotyping: (a) it is able to handle high-dimensional data and variable numbers of encounters across patients, (b) it has an intuitive interpretation and (c) it is free from ad-hoc parameter choices. Computational phenotypes, such as the ones computed by our approach, have multiple applications; we highlight three of them which are particularly useful for medically complex children: (1) integration into clinical decision support systems, (2) interpretable mortality prediction and 3) clinical trial recruitment. CONCLUSION PARAFAC2 can be applied to unsupervised temporal phenotyping tasks where precise definitions of different phenotypes are absent, and lengths of patient records are varying.
Collapse
Affiliation(s)
| | | | | | | | - Jimeng Sun
- Georgia Institute of Technology, United States.
| |
Collapse
|
46
|
Bai T, Chanda AK, Egleston BL, Vucetic S. EHR phenotyping via jointly embedding medical concepts and words into a unified vector space. BMC Med Inform Decis Mak 2018; 18:123. [PMID: 30537974 PMCID: PMC6290514 DOI: 10.1186/s12911-018-0672-0] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background There has been an increasing interest in learning low-dimensional vector representations of medical concepts from Electronic Health Records (EHRs). Vector representations of medical concepts facilitate exploratory analysis and predictive modeling of EHR data to gain insights about the patterns of care and health outcomes. EHRs contain structured data such as diagnostic codes and laboratory tests, as well as unstructured free text data in form of clinical notes, which provide more detail about condition and treatment of patients. Methods In this work, we propose a method that jointly learns vector representations of medical concepts and words. This is achieved by a novel learning scheme based on the word2vec model. Our model learns those relationships by integrating clinical notes and sets of accompanying medical codes and by defining joint contexts for each observed word and medical code. Results In our experiments, we learned joint representations using MIMIC-III data. Using the learned representations of words and medical codes, we evaluated phenotypes for 6 diseases discovered by our and baseline method. The experimental results show that for each of the 6 diseases our method finds highly relevant words. We also show that our representations can be very useful when predicting the reason for the next visit. Conclusions The jointly learned representations of medical concepts and words capture not only similarity between codes or words themselves, but also similarity between codes and words. They can be used to extract phenotypes of different diseases. The representations learned by the joint model are also useful for construction of patient features.
Collapse
Affiliation(s)
- Tian Bai
- Department of Computer & Information Sciences, Temple University, Philadelphia, PA, USA
| | - Ashis Kumar Chanda
- Department of Computer & Information Sciences, Temple University, Philadelphia, PA, USA
| | - Brian L Egleston
- Fox Chase Cancer Center, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer & Information Sciences, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
47
|
Schmider J, Kumar K, LaForest C, Swankoski B, Naim K, Caubel PM. Innovation in Pharmacovigilance: Use of Artificial Intelligence in Adverse Event Case Processing. Clin Pharmacol Ther 2018; 105:954-961. [PMID: 30303528 PMCID: PMC6590385 DOI: 10.1002/cpt.1255] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 09/25/2018] [Indexed: 12/03/2022]
Abstract
Automation of pharmaceutical safety case processing represents a significant opportunity to affect the strongest cost driver for a company's overall pharmacovigilance budget. A pilot was undertaken to test the feasibility of using artificial intelligence and robotic process automation to automate processing of adverse event reports. The pilot paradigm was used to simultaneously test proposed solutions of three commercial vendors. The result confirmed the feasibility of using artificial intelligence–based technology to support extraction from adverse event source documents and evaluation of case validity. In addition, the pilot demonstrated viability of the use of safety database data fields as a surrogate for otherwise time‐consuming and costly direct annotation of source documents. Finally, the evaluation and scoring method used in the pilot was able to differentiate vendor capabilities and identify the best candidate to move into the discovery phase.
Collapse
Affiliation(s)
| | - Krishan Kumar
- Pfizer Business Technology, Artificial Intelligence Center of Excellence, La Jolla, California, USA
| | - Chantal LaForest
- Pfizer Global Product Development, Safety Solutions, Kirkland, Quebec, Ontario, Canada
| | - Brian Swankoski
- Pfizer Finance and Business Operations, Peapack, New Jersey, USA
| | - Karen Naim
- Pfizer R&D, Collegeville, Pennsylvania, USA
| | | |
Collapse
|
48
|
Nie A, Zehnder A, Page RL, Zhang Y, Pineda AL, Rivas MA, Bustamante CD, Zou J. DeepTag: inferring diagnoses from veterinary clinical notes. NPJ Digit Med 2018; 1:60. [PMID: 31304339 PMCID: PMC6550285 DOI: 10.1038/s41746-018-0067-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2018] [Revised: 10/08/2018] [Accepted: 10/10/2018] [Indexed: 12/13/2022] Open
Abstract
Large scale veterinary clinical records can become a powerful resource for patient care and research. However, clinicians lack the time and resource to annotate patient records with standard medical diagnostic codes and most veterinary visits are captured in free-text notes. The lack of standard coding makes it challenging to use the clinical data to improve patient care. It is also a major impediment to cross-species translational research, which relies on the ability to accurately identify patient cohorts with specific diagnostic criteria in humans and animals. In order to reduce the coding burden for veterinary clinical practice and aid translational research, we have developed a deep learning algorithm, DeepTag, which automatically infers diagnostic codes from veterinary free-text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multitask LSTM with an improved hierarchical objective that captures the semantic structures between diseases. To foster human-machine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defers them to human experts, resulting in improved performance. DeepTag accurately infers disease codes from free-text even in challenging cross-hospital settings where the text comes from different clinical settings than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal preprocessing. The technical framework in this work can be applied in other medical domains that currently lack medical coding resources.
Collapse
Affiliation(s)
- Allen Nie
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
| | - Ashley Zehnder
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
| | - Rodney L. Page
- Department of Clinical Sciences, Colorado State University, Fort Collins, CO 80523 USA
| | - Yuhui Zhang
- Department of Computer Science and Technology, Tsinghua University, Beijing, China
| | - Arturo Lopez Pineda
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
| | - Manuel A. Rivas
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
| | - Carlos D. Bustamante
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
- Chan-Zuckerberg Biohub, San Francisco, CA 94158 USA
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305 USA
- Chan-Zuckerberg Biohub, San Francisco, CA 94158 USA
| |
Collapse
|
49
|
Huang Z, Ge Z, Dong W, He K, Duan H, Bath P. Relational regularized risk prediction of acute coronary syndrome using electronic health records. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
50
|
Levy-Fix G, Gorman SL, Sepulveda JL, Elhadad N. When to re-order laboratory tests? Learning laboratory test shelf-life. J Biomed Inform 2018; 85:21-29. [PMID: 30036675 PMCID: PMC11073806 DOI: 10.1016/j.jbi.2018.07.019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 06/15/2018] [Accepted: 07/19/2018] [Indexed: 10/28/2022]
Abstract
Most laboratory results are valid for only a certain time period (laboratory tests shelf-life), after which they are outdated and the test needs to be re-administered. Currently, laboratory test shelf-lives are not centrally available anywhere but the implicit knowledge of doctors. In this work we propose an automated method to learn laboratory test-specific shelf-life by identifying prevalent laboratory test order patterns in electronic health records. The resulting shelf-lives performed well in the evaluation of internal validity, clinical interpretability, and external validity.
Collapse
Affiliation(s)
- Gal Levy-Fix
- Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA.
| | - Sharon Lipsky Gorman
- Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA
| | - Jorge L Sepulveda
- Department of Pathology and Cell Biology, Columbia University, 630 W. 168th Street, New York, NY, USA
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA
| |
Collapse
|