1
|
O'Neil ST, Madlock-Brown C, Wilkins KJ, McGrath BM, Davis HE, Assaf GS, Wei H, Zareie P, French ET, Loomba J, McMurry JA, Zhou A, Chute CG, Moffitt RA, Pfaff ER, Yoo YJ, Leese P, Chew RF, Lieberman M, Haendel MA. Finding Long-COVID: temporal topic modeling of electronic health records from the N3C and RECOVER programs. NPJ Digit Med 2024; 7:296. [PMID: 39433942 PMCID: PMC11494196 DOI: 10.1038/s41746-024-01286-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Accepted: 10/07/2024] [Indexed: 10/23/2024] Open
Abstract
Post-Acute Sequelae of SARS-CoV-2 infection (PASC), also known as Long-COVID, encompasses a variety of complex and varied outcomes following COVID-19 infection that are still poorly understood. We clustered over 600 million condition diagnoses from 14 million patients available through the National COVID Cohort Collaborative (N3C), generating hundreds of highly detailed clinical phenotypes. Assessing patient clinical trajectories using these clusters allowed us to identify individual conditions and phenotypes strongly increased after acute infection. We found many conditions increased in COVID-19 patients compared to controls, and using a novel method to associate patients with clusters over time, we additionally found phenotypes specific to patient sex, age, wave of infection, and PASC diagnosis status. While many of these results reflect known PASC symptoms, the resolution provided by this unprecedented data scale suggests avenues for improved diagnostics and mechanistic understanding of this multifaceted disease.
Collapse
Affiliation(s)
- Shawn T O'Neil
- Department of Genetics, UNC School of Medicine, Chapel Hill, NC, USA.
| | - Charisse Madlock-Brown
- Health Informatics and Information Management Program, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Kenneth J Wilkins
- Biostatistics Program, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | | | - Hannah E Davis
- Patient-Led Research Collaborative (PLRC), Washington, DC, USA
| | - Gina S Assaf
- Patient-Led Research Collaborative (PLRC), Washington, DC, USA
| | - Hannah Wei
- Patient-Led Research Collaborative (PLRC), Washington, DC, USA
| | - Parya Zareie
- University of California Davis Health, Davis, CA, USA
| | - Evan T French
- Wright Center for Clinical and Translational Research, Virginia Commonwealth University, Richmond, VA, USA
| | - Johanna Loomba
- The Integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia, Charlottesville, VA, USA
| | - Julie A McMurry
- Department of Genetics, UNC School of Medicine, Chapel Hill, NC, USA
| | - Andrea Zhou
- The Integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia, Charlottesville, VA, USA
| | - Christopher G Chute
- Schools of Medicine, Public Health and Nursing, Johns Hopkins University, Baltimore, MD, USA
| | - Richard A Moffitt
- Department of Hematology and Medical Oncology, Emory University, Atlanta, GA, USA
| | - Emily R Pfaff
- NC TraCS Institute, UNC School of Medicine, Chapel Hill, NC, USA
| | - Yun Jae Yoo
- Department of Hematology and Medical Oncology, Emory University, Atlanta, GA, USA
| | - Peter Leese
- NC TraCS Institute, UNC School of Medicine, Chapel Hill, NC, USA
| | - Robert F Chew
- Center for Data Science and AI, RTI International, Research Triangle Park, Durham, NC, USA
| | - Michael Lieberman
- OCHIN, Inc, Portland, OR, USA
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR, USA
| | - Melissa A Haendel
- Department of Genetics, UNC School of Medicine, Chapel Hill, NC, USA
| |
Collapse
|
2
|
Rajamaki B, Braithwaite B, Hartikainen S, Tolppanen AM. Identifying Comorbidity Patterns in People with and without Alzheimer's Disease Using Latent Dirichlet Allocation. J Alzheimers Dis 2024; 101:1393-1403. [PMID: 39302369 PMCID: PMC11492117 DOI: 10.3233/jad-240490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/31/2024] [Indexed: 09/22/2024]
Abstract
Background Multimorbidity is common in older adults and complicates diagnosing and care for this population. Objective We investigated co-occurrence patterns (clustering) of medical conditions in persons with Alzheimer's disease (AD) and their matched controls. Methods The register-based Medication use and Alzheimer's disease study (MEDALZ) includes 70,718 community-dwelling persons with incident AD diagnosed during 2005-2011 in Finland and a matched comparison cohort. Latent Dirichlet Allocation was used to cluster the comorbidities (ICD-10 diagnosis codes). Modeling was performed separately for AD and control cohorts. We experimented with different numbers of clusters (also known as topics in the field of Natural Language Processing) ranging from five to 20. Results In both cohorts, 17 of the 20 most frequent diagnoses were the same. Based on a qualitative assessment by medical experts, the cluster patterns were not affected by the number of clusters, but the best interpretability was observed in the 10-cluster model. Quantitative assessment of the optimal number of clusters by log-likelihood estimate did not imply a specific optimal number of clusters. Multidimensional scaling visualized the variability in cluster size and (dis)similarity between the clusters with more overlapping of clusters and variation in group size seen in the AD cohort. Conclusions Early signs and symptoms of AD were more commonly clustered together in the AD cohort than in the comparison cohort. This study experimented with using natural language processing techniques for clustering patterns from an epidemiological study. From the computed clusters, it was possible to qualitatively identify multimorbidity that differentiates AD cases and controls.
Collapse
Affiliation(s)
- Blair Rajamaki
- School of Pharmacy, University of Eastern Finland, Kuopio, Finland
- Kuopio Research Centre of Geriatric Care, University of Eastern Finland, Kuopio, Finland
| | | | - Sirpa Hartikainen
- School of Pharmacy, University of Eastern Finland, Kuopio, Finland
- Kuopio Research Centre of Geriatric Care, University of Eastern Finland, Kuopio, Finland
| | - Anna-Maija Tolppanen
- School of Pharmacy, University of Eastern Finland, Kuopio, Finland
- Kuopio Research Centre of Geriatric Care, University of Eastern Finland, Kuopio, Finland
| |
Collapse
|
3
|
Ramon-Gonen R, Dori A, Shelly S. Towards a practical use of text mining approaches in electrodiagnostic data. Sci Rep 2023; 13:19483. [PMID: 37945618 PMCID: PMC10636146 DOI: 10.1038/s41598-023-45758-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 10/23/2023] [Indexed: 11/12/2023] Open
Abstract
Healthcare professionals produce abounding textual data in their daily clinical practice. Text mining can yield valuable insights from unstructured data. Extracting insights from multiple information sources is a major challenge in computational medicine. In this study, our objective was to illustrate how combining text mining techniques with statistical methodologies can yield new insights and contribute to the development of neurological and neuromuscular-related health information. We demonstrate how to utilize and derive knowledge from medical text, identify patient groups with similar diagnostic attributes, and examine differences between groups using demographical data and past medical history (PMH). We conducted a retrospective study for all patients who underwent electrodiagnostic (EDX) evaluation in Israel's Sheba Medical Center between May 2016 and February 2022. The data extracted for each patient included demographic data, test results, and unstructured summary reports. We conducted several analyses, including topic modeling that targeted clinical impressions and topic analysis to reveal age- and sex-related differences. The use of suspected clinical condition text enriched the data and generated additional attributes used to find associations between patients' PMH and the emerging diagnosis topics. We identified 6096 abnormal EMG results, of which 58% (n = 3512) were males. Based on the latent Dirichlet allocation algorithm we identified 25 topics that represent different diagnoses. Sex-related differences emerged in 7 topics, 3 male-associated and 4 female-associated. Brachial plexopathy, myasthenia gravis, and NMJ Disorders showed statistically significant age and sex differences. We extracted keywords related to past medical history (n = 37) and tested them for association with the different topics. Several topics revealed a close association with past medical history, for example, length-dependent symmetric axonal polyneuropathy with diabetes mellitus (DM), length-dependent sensory polyneuropathy with chemotherapy treatments and DM, brachial plexopathy with motor vehicle accidents, myasthenia gravis and NMJ disorders with botulin treatments, and amyotrophic lateral sclerosis with swallowing difficulty. Summarizing visualizations were created to easily grasp the results and facilitate focusing on the main insights. In this study, we demonstrate the efficacy of utilizing advanced computational methods in a corpus of textual data to accelerate clinical research. Additionally, using these methods allows for generating clinical insights, which may aid in the development of a decision-making process in real-life clinical practice.
Collapse
Affiliation(s)
- Roni Ramon-Gonen
- The Graduate School of Business Administration, Bar-Ilan University, Ramat Gan, Israel.
| | - Amir Dori
- Department of Neurology, Sheba Medical Center, Tel HaShomer, Israel
- Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Shahar Shelly
- Department of Neurology, Rambam Health Care Campus, Haifa, Israel
- Neuroimmunology Laboratory, The Ruth & Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa, Israel
- Department of Neurology, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
4
|
Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 2023; 30:367-381. [PMID: 36413056 PMCID: PMC9846699 DOI: 10.1093/jamia/ocac216] [Citation(s) in RCA: 31] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/27/2022] [Accepted: 10/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. MATERIALS AND METHODS We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. RESULTS Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. DISCUSSION Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. CONCLUSION Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
Collapse
Affiliation(s)
- Siyue Yang
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | | | - Ellen Stephenson
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Karen Tu
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
5
|
Weber C, Röschke L, Modersohn L, Lohr C, Kolditz T, Hahn U, Ammon D, Betz B, Kiehntopf M. Optimized Identification of Advanced Chronic Kidney Disease and Absence of Kidney Disease by Combining Different Electronic Health Data Resources and by Applying Machine Learning Strategies. J Clin Med 2020; 9:jcm9092955. [PMID: 32932685 PMCID: PMC7563476 DOI: 10.3390/jcm9092955] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 08/26/2020] [Accepted: 08/28/2020] [Indexed: 12/31/2022] Open
Abstract
Automated identification of advanced chronic kidney disease (CKD ≥ III) and of no known kidney disease (NKD) can support both clinicians and researchers. We hypothesized that identification of CKD and NKD can be improved, by combining information from different electronic health record (EHR) resources, comprising laboratory values, discharge summaries and ICD-10 billing codes, compared to using each component alone. We included EHRs from 785 elderly multimorbid patients, hospitalized between 2010 and 2015, that were divided into a training and a test (n = 156) dataset. We used both the area under the receiver operating characteristic (AUROC) and under the precision-recall curve (AUCPR) with a 95% confidence interval for evaluation of different classification models. In the test dataset, the combination of EHR components as a simple classifier identified CKD ≥ III (AUROC 0.96[0.93-0.98]) and NKD (AUROC 0.94[0.91-0.97]) better than laboratory values (AUROC CKD 0.85[0.79-0.90], NKD 0.91[0.87-0.94]), discharge summaries (AUROC CKD 0.87[0.82-0.92], NKD 0.84[0.79-0.89]) or ICD-10 billing codes (AUROC CKD 0.85[0.80-0.91], NKD 0.77[0.72-0.83]) alone. Logistic regression and machine learning models improved recognition of CKD ≥ III compared to the simple classifier if only laboratory values were used (AUROC 0.96[0.92-0.99] vs. 0.86[0.81-0.91], p < 0.05) and improved recognition of NKD if information from previous hospital stays was used (AUROC 0.99[0.98-1.00] vs. 0.95[0.92-0.97]], p < 0.05). Depending on the availability of data, correct automated identification of CKD ≥ III and NKD from EHRs can be improved by generating classification models based on the combination of different EHR components.
Collapse
Affiliation(s)
- Christoph Weber
- Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital, 07747 Jena, Germany; (C.W.); (L.R.)
| | - Lena Röschke
- Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital, 07747 Jena, Germany; (C.W.); (L.R.)
| | - Luise Modersohn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, 07743 Jena, Germany; (L.M.); (C.L.); (T.K.); (U.H.)
| | - Christina Lohr
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, 07743 Jena, Germany; (L.M.); (C.L.); (T.K.); (U.H.)
| | - Tobias Kolditz
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, 07743 Jena, Germany; (L.M.); (C.L.); (T.K.); (U.H.)
| | - Udo Hahn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, 07743 Jena, Germany; (L.M.); (C.L.); (T.K.); (U.H.)
| | - Danny Ammon
- Data Integration Center, Jena University Hospital, 07743 Jena, Germany;
| | - Boris Betz
- Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital, 07747 Jena, Germany; (C.W.); (L.R.)
- Correspondence: (B.B.); (M.K.); Tel.: +49-3641-9-325074 (B.B.); +49-3641-9-325001 (M.K.)
| | - Michael Kiehntopf
- Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital, 07747 Jena, Germany; (C.W.); (L.R.)
- Correspondence: (B.B.); (M.K.); Tel.: +49-3641-9-325074 (B.B.); +49-3641-9-325001 (M.K.)
| |
Collapse
|
6
|
Roque C, Lourenço Cardoso J, Connell T, Schermers G, Weber R. Topic analysis of Road safety inspections using latent dirichlet allocation: A case study of roadside safety in Irish main roads. ACCIDENT; ANALYSIS AND PREVENTION 2019; 131:336-349. [PMID: 31377497 DOI: 10.1016/j.aap.2019.07.021] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2019] [Revised: 07/17/2019] [Accepted: 07/20/2019] [Indexed: 06/10/2023]
Abstract
Under the Safe System framework, Road Authorities have a responsibility to deliver inherently safe roads and streets. Addressing this problem depends on knowledge of the road network safety conditions and the number of funds available for new road safety interventions. It also requires the prioritisation of the various interventions that may generate benefits, increasing safety, while ensuring that reasonable steps are taken to remedy the deficiencies detected within a reasonable timeframe. In this context, Road Safety Inspections (RSI) are a proactive tool for identifying safety issues, consisting of a regular, systematic, on-site inspection of existing roads, covering the whole road network, carried out by trained safety expert teams. This paper aims to describe how topic modelling can be effectively used to identify co-occurrence patterns of attributes related to the run-off-road crashes, as well as the corresponding patterns of road safety interventions, as described in the RSI reports. We apply latent Dirichlet allocation (LDA), a widespread method for fitting a topic model, to analyse the topics mentioned in RSI reports, divided into two groups: problems found; and proposed solutions. For this study, 54 RSI gathered over six years (2012-2017) were analysed, covering 4011 km of Irish roads. The results indicate that important keywords relating to the "forgiving roadside" and "clear zone" concepts, as well as the relevant European technical standards (CEN-EN1317 and EN 12,767), are absent from the extracted latent topics. We also found that the frequency of topics related to roadside safety is higher in the problems record set than in the solutions record set, meaning that problems are more easily identified and related to the roadside area than interventions may be. This paper presents methodological empirical evidence that the LDA is appropriate for identifying the co-occurrence patterns of attributes related to the ROR crashes in road safety inspections' reports, as well as the interventions' patterns associated with these crashes. Also, it provides valuable information aimed to determine the extent to which national road authorities in Europe and their contractors are currently capable of implementing and maintaining compliance with roadside standards and guidelines throughout the life cycle of roads.
Collapse
Affiliation(s)
- Carlos Roque
- Laboratório Nacional de Engenharia Civil, Departamento de Transportes, Núcleo de Planeamento, Tráfego e Segurança, Av do Brasil 101, 1700-066 Lisboa, Portugal.
| | - João Lourenço Cardoso
- Laboratório Nacional de Engenharia Civil, Departamento de Transportes, Núcleo de Planeamento, Tráfego e Segurança, Av do Brasil 101, 1700-066 Lisboa, Portugal.
| | | | - Govert Schermers
- SWOV Institute for Road Safety Research, Bezuidenhoutseweg 62, 2509 AC The Hague, the Netherlands.
| | | |
Collapse
|
7
|
Ta CN, Dumontier M, Hripcsak G, Tatonetti NP, Weng C. Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Sci Data 2018; 5:180273. [PMID: 30480666 PMCID: PMC6257042 DOI: 10.1038/sdata.2018.273] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2018] [Accepted: 10/16/2018] [Indexed: 12/11/2022] Open
Abstract
Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center's Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013-2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub.
Collapse
Affiliation(s)
- Casey N. Ta
- Department of Biomedical Informatics, Columbia University, NY, USA
| | - Michel Dumontier
- Institute of Data Science, Maastricht University, Maastricht, The Netherlands
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, NY, USA
| | - Nicholas P. Tatonetti
- Department of Biomedical Informatics, Columbia University, NY, USA
- Department of Systems Biology, Columbia University, NY, USA
- Department of Medicine, Columbia University, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, NY, USA
| |
Collapse
|