1
|
Beaney T, Jha S, Alaa A, Smith A, Clarke J, Woodcock T, Majeed A, Aylin P, Barahona M. Comparing natural language processing representations of coded disease sequences for prediction in electronic health records. J Am Med Inform Assoc 2024; 31:1451-1462. [PMID: 38719204 PMCID: PMC11187492 DOI: 10.1093/jamia/ocae091] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 04/02/2024] [Accepted: 04/12/2024] [Indexed: 06/21/2024] Open
Abstract
OBJECTIVE Natural language processing (NLP) algorithms are increasingly being applied to obtain unsupervised representations of electronic health record (EHR) data, but their comparative performance at predicting clinical endpoints remains unclear. Our objective was to compare the performance of unsupervised representations of sequences of disease codes generated by bag-of-words versus sequence-based NLP algorithms at predicting clinically relevant outcomes. MATERIALS AND METHODS This cohort study used primary care EHRs from 6 286 233 people with Multiple Long-Term Conditions in England. For each patient, an unsupervised vector representation of their time-ordered sequences of diseases was generated using 2 input strategies (212 disease categories versus 9462 diagnostic codes) and different NLP algorithms (Latent Dirichlet Allocation, doc2vec, and 2 transformer models designed for EHRs). We also developed a transformer architecture, named EHR-BERT, incorporating sociodemographic information. We compared the performance of each of these representations (without fine-tuning) as inputs into a logistic classifier to predict 1-year mortality, healthcare use, and new disease diagnosis. RESULTS Patient representations generated by sequence-based algorithms performed consistently better than bag-of-words methods in predicting clinical endpoints, with the highest performance for EHR-BERT across all tasks, although the absolute improvement was small. Representations generated using disease categories perform similarly to those using diagnostic codes as inputs, suggesting models can equally manage smaller or larger vocabularies for prediction of these outcomes. DISCUSSION AND CONCLUSION Patient representations produced by sequence-based NLP algorithms from sequences of disease codes demonstrate improved predictive content for patient outcomes compared with representations generated by co-occurrence-based algorithms. This suggests transformer models may be useful for generating multi-purpose representations, even without fine-tuning.
Collapse
Affiliation(s)
- Thomas Beaney
- Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom
- Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom
| | - Sneha Jha
- Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom
| | - Asem Alaa
- Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom
| | - Alexander Smith
- Department of Epidemiology and Biostatistics, Imperial College London, London, W2 1PG, United Kingdom
| | - Jonathan Clarke
- Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom
| | - Thomas Woodcock
- Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom
| | - Azeem Majeed
- Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom
| | - Paul Aylin
- Department of Primary Care and Public Health, Imperial College London, London, W12 0BZ, United Kingdom
| | - Mauricio Barahona
- Department of Mathematics, Centre for Mathematics of Precision Healthcare, Imperial College London, London, SW7 2AZ, United Kingdom
| |
Collapse
|
2
|
Jain H, Odat RM, Goyal A, Jain J, Dey D, Ahmed M, Wasir AS, Passey S, Gole S. Association between psoriasis and atrial fibrillation: A Systematic review and meta-analysis. Curr Probl Cardiol 2024; 49:102538. [PMID: 38521291 DOI: 10.1016/j.cpcardiol.2024.102538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Accepted: 03/20/2024] [Indexed: 03/25/2024]
Abstract
INTRODUCTION Psoriasis is a prevalent inflammatory skin condition characterized by erythematous plaques with scaling. Recent research has demonstrated an increased risk of cardiovascular diseases in patients with psoriasis; however, current evidence on atrial fibrillation (AF) risk in psoriasis is limited. MATERIALS AND METHODS A systematic literature search was performed on major bibliographic databases to retrieve studies that evaluated AF risk in patients with psoriasis. The DerSimonian and Laird random effects model was used to pool the hazard ratios (HR) with 95 % confidence intervals (CI). Subgroup analysis was conducted by dividing the patients into mild and severe psoriasis groups. Publication bias was assessed by visual inspection and Egger's regression test. Statistical significance was set at p < 0.05. RESULTS Seven studies were included, with 10,974,668 participants (1,94,230 in the psoriasis group and 10,780,439 in the control group). Patients with psoriasis had a significantly higher risk of AF [Pooled HR: 1.28; 95 % CI: 1.20, 1.36; p < 0.00001]. In subgroup analysis, patients with severe psoriasis [HR: 1.32; 95 % CI: 1.23, 1.42; p < 0.00001] demonstrated a slightly higher risk of AF, although statistically insignificant (p = 0.17), than the mild psoriasis group [HR: 1.21; 95 % CI: 1.10, 1.33; p < 0.0001]. Egger's regression test showed no statistically significant publication bias (p = 0.24). CONCLUSION Our analysis demonstrated that patients with psoriasis are at a significantly higher risk of AF and hence should be closely monitored for AF. Further large-scale and multicenter randomized trials are warranted to validate the robustness of our findings.
Collapse
Affiliation(s)
- Hritvik Jain
- Department of Internal Medicine, All India Institute of Medical Sciences (AIIMS), Jodhpur, India.
| | - Ramez M Odat
- Department of Internal Medicine, Faculty of Medicine, Jordan University of Science and Technology, Irbid, Jordan
| | - Aman Goyal
- Department of Internal Medicine, Seth GS Medical College and KEM Hospital, Mumbai, India
| | - Jyoti Jain
- Department of Internal Medicine, All India Institute of Medical Sciences (AIIMS), Jodhpur, India
| | - Debankur Dey
- Department of Internal Medicine, Medical College Kolkata, Kolkata, West Bengal, India
| | - Mushood Ahmed
- Department of Internal Medicine, Rawalpindi Medical University, Rawalpindi, Pakistan
| | - Amanpreet Singh Wasir
- Department of Internal Medicine, Bharati Vidyapeeth (Deemed to be) University Medical College, Pune, Maharashtra, India
| | - Siddhant Passey
- Department of Internal Medicine, University of Connecticut Health Center, CT, USA
| | - Shrey Gole
- Department of Immunology and Rheumatology, Stanford University, CA, USA
| |
Collapse
|
3
|
Beaney T, Clarke J, Salman D, Woodcock T, Majeed A, Aylin P, Barahona M. Identifying multi-resolution clusters of diseases in ten million patients with multimorbidity in primary care in England. COMMUNICATIONS MEDICINE 2024; 4:102. [PMID: 38811835 PMCID: PMC11137021 DOI: 10.1038/s43856-024-00529-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2023] [Accepted: 05/20/2024] [Indexed: 05/31/2024] Open
Abstract
BACKGROUND Identifying clusters of diseases may aid understanding of shared aetiology, management of co-morbidities, and the discovery of new disease associations. Our study aims to identify disease clusters using a large set of long-term conditions and comparing methods that use the co-occurrence of diseases versus methods that use the sequence of disease development in a person over time. METHODS We use electronic health records from over ten million people with multimorbidity registered to primary care in England. First, we extract data-driven representations of 212 diseases from patient records employing (i) co-occurrence-based methods and (ii) sequence-based natural language processing methods. Second, we apply the graph-based Markov Multiscale Community Detection (MMCD) to identify clusters based on disease similarity at multiple resolutions. We evaluate the representations and clusters using a clinically curated set of 253 known disease association pairs, and qualitatively assess the interpretability of the clusters. RESULTS Both co-occurrence and sequence-based algorithms generate interpretable disease representations, with the best performance from the skip-gram algorithm. MMCD outperforms k-means and hierarchical clustering in explaining known disease associations. We find that diseases display an almost-hierarchical structure across resolutions from closely to more loosely similar co-occurrence patterns and identify interpretable clusters corresponding to both established and novel patterns. CONCLUSIONS Our method provides a tool for clustering diseases at different levels of resolution from co-occurrence patterns in high-dimensional electronic health records, which could be used to facilitate discovery of associations between diseases in the future.
Collapse
Affiliation(s)
- Thomas Beaney
- Department of Primary Care and Public Health, Imperial College London, London, W6 8RP, UK.
- Department of Mathematics, Imperial College London, London, SW7 2AZ, UK.
| | - Jonathan Clarke
- Department of Mathematics, Imperial College London, London, SW7 2AZ, UK
| | - David Salman
- Department of Primary Care and Public Health, Imperial College London, London, W6 8RP, UK
- MSk Lab, Department of Surgery and Cancer, Imperial College London, London, W12 0BZ, UK
| | - Thomas Woodcock
- Department of Primary Care and Public Health, Imperial College London, London, W6 8RP, UK
| | - Azeem Majeed
- Department of Primary Care and Public Health, Imperial College London, London, W6 8RP, UK
| | - Paul Aylin
- Department of Primary Care and Public Health, Imperial College London, London, W6 8RP, UK
| | - Mauricio Barahona
- Department of Mathematics, Imperial College London, London, SW7 2AZ, UK
| |
Collapse
|
4
|
Beaney T, Clarke J, Woodcock T, Majeed A, Barahona M, Aylin P. Effect of timeframes to define long term conditions and sociodemographic factors on prevalence of multimorbidity using disease code frequency in primary care electronic health records: retrospective study. BMJ MEDICINE 2024; 3:e000474. [PMID: 38361663 PMCID: PMC10868275 DOI: 10.1136/bmjmed-2022-000474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 12/12/2023] [Indexed: 02/17/2024]
Abstract
Objective To determine the extent to which the choice of timeframe used to define a long term condition affects the prevalence of multimorbidity and whether this varies with sociodemographic factors. Design Retrospective study of disease code frequency in primary care electronic health records. Data sources Routinely collected, general practice, electronic health record data from the Clinical Practice Research Datalink Aurum were used. Main outcome measures Adults (≥18 years) in England who were registered in the database on 1 January 2020 were included. Multimorbidity was defined as the presence of two or more conditions from a set of 212 long term conditions. Multimorbidity prevalence was compared using five definitions. Any disease code recorded in the electronic health records for 212 conditions was used as the reference definition. Additionally, alternative definitions for 41 conditions requiring multiple codes (where a single disease code could indicate an acute condition) or a single code for the remaining 171 conditions were as follows: two codes at least three months apart; two codes at least 12 months apart; three codes within any 12 month period; and any code in the past 12 months. Mixed effects regression was used to calculate the expected change in multimorbidity status and number of long term conditions according to each definition and associations with patient age, gender, ethnic group, and socioeconomic deprivation. Results 9 718 573 people were included in the study, of whom 7 183 662 (73.9%) met the definition of multimorbidity where a single code was sufficient to define a long term condition. Variation was substantial in the prevalence according to timeframe used, ranging from 41.4% (n=4 023 023) for three codes in any 12 month period, to 55.2% (n=5 366 285) for two codes at least three months apart. Younger people (eg, 50-75% probability for 18-29 years v 1-10% for ≥80 years), people of some minority ethnic groups (eg, people in the Other ethnic group had higher probability than the South Asian ethnic group), and people living in areas of lower socioeconomic deprivation were more likely to be re-classified as not multimorbid when using definitions requiring multiple codes. Conclusions Choice of timeframe to define long term conditions has a substantial effect on the prevalence of multimorbidity in this nationally representative sample. Different timeframes affect prevalence for some people more than others, highlighting the need to consider the impact of bias in the choice of method when defining multimorbidity.
Collapse
Affiliation(s)
- Thomas Beaney
- Department of Primary Care and Public Health, Imperial College London, London, UK
- Department of Mathematics, Imperial College London, London, UK
| | - Jonathan Clarke
- Department of Mathematics, Imperial College London, London, UK
| | - Thomas Woodcock
- Department of Primary Care and Public Health, Imperial College London, London, UK
| | - Azeem Majeed
- Department of Primary Care and Public Health, Imperial College London, London, UK
| | | | - Paul Aylin
- Department of Primary Care and Public Health, Imperial College London, London, UK
| |
Collapse
|
5
|
Beaney T, Clarke J, Salman D, Woodcock T, Majeed A, Barahona M, Aylin P. Assigning disease clusters to people: A cohort study of the implications for understanding health outcomes in people with multiple long-term conditions. JOURNAL OF MULTIMORBIDITY AND COMORBIDITY 2024; 14:26335565241247430. [PMID: 38638408 PMCID: PMC11025432 DOI: 10.1177/26335565241247430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Accepted: 03/25/2024] [Indexed: 04/20/2024]
Abstract
Background Identifying clusters of co-occurring diseases may help characterise distinct phenotypes of Multiple Long-Term Conditions (MLTC). Understanding the associations of disease clusters with health-related outcomes requires a strategy to assign clusters to people, but it is unclear how the performance of strategies compare. Aims First, to compare the performance of methods of assigning disease clusters to people at explaining mortality, emergency department attendances and hospital admissions over one year. Second, to identify the extent of variation in the associations with each outcome between and within clusters. Methods We conducted a cohort study of primary care electronic health records in England, including adults with MLTC. Seven strategies were tested to assign patients to fifteen disease clusters representing 212 LTCs, identified from our previous work. We tested the performance of each strategy at explaining associations with the three outcomes over 1 year using logistic regression and compared to a strategy using the individual LTCs. Results 6,286,233 patients with MLTC were included. Of the seven strategies tested, a strategy assigning the count of conditions within each cluster performed best at explaining all three outcomes but was inferior to using information on the individual LTCs. There was a larger range of effect sizes for the individual LTCs within the same cluster than there was between the clusters. Conclusion Strategies of assigning clusters of co-occurring diseases to people were less effective at explaining health-related outcomes than a person's individual diseases. Furthermore, clusters did not represent consistent relationships of the LTCs within them, which might limit their application in clinical research.
Collapse
Affiliation(s)
- Thomas Beaney
- Department of Primary Care and Public Health, Imperial College London, London, UK
- Centre for Mathematics of Precision Healthcare, Department of Mathematics, Imperial College London, London, UK
| | - Jonathan Clarke
- Centre for Mathematics of Precision Healthcare, Department of Mathematics, Imperial College London, London, UK
| | - David Salman
- Department of Primary Care and Public Health, Imperial College London, London, UK
| | - Thomas Woodcock
- Department of Primary Care and Public Health, Imperial College London, London, UK
| | - Azeem Majeed
- Department of Primary Care and Public Health, Imperial College London, London, UK
| | - Mauricio Barahona
- Centre for Mathematics of Precision Healthcare, Department of Mathematics, Imperial College London, London, UK
| | - Paul Aylin
- Department of Primary Care and Public Health, Imperial College London, London, UK
| |
Collapse
|