51
|
Castro VM, Gainer V, Wattanasin N, Benoit B, Cagan A, Ghosh B, Goryachev S, Metta R, Park H, Wang D, Mendis M, Rees M, Herrick C, Murphy SN. The Mass General Brigham Biobank Portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. J Am Med Inform Assoc 2022; 29:643-651. [PMID: 34849976 PMCID: PMC8922162 DOI: 10.1093/jamia/ocab264] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Revised: 10/20/2021] [Accepted: 11/16/2021] [Indexed: 01/07/2023] Open
Abstract
OBJECTIVE Integrating and harmonizing disparate patient data sources into one consolidated data portal enables researchers to conduct analysis efficiently and effectively. MATERIALS AND METHODS We describe an implementation of Informatics for Integrating Biology and the Bedside (i2b2) to create the Mass General Brigham (MGB) Biobank Portal data repository. The repository integrates data from primary and curated data sources and is updated weekly. The data are made readily available to investigators in a data portal where they can easily construct and export customized datasets for analysis. RESULTS As of July 2021, there are 125 645 consented patients enrolled in the MGB Biobank. 88 527 (70.5%) have a biospecimen, 55 121 (43.9%) have completed the health information survey, 43 552 (34.7%) have genomic data and 124 760 (99.3%) have EHR data. Twenty machine learning computed phenotypes are calculated on a weekly basis. There are currently 1220 active investigators who have run 58 793 patient queries and exported 10 257 analysis files. DISCUSSION The Biobank Portal allows noninformatics researchers to conduct study feasibility by querying across many data sources and then extract data that are most useful to them for clinical studies. While institutions require substantial informatics resources to establish and maintain integrated data repositories, they yield significant research value to a wide range of investigators. CONCLUSION The Biobank Portal and other patient data portals that integrate complex and simple datasets enable diverse research use cases. i2b2 tools to implement these registries and make the data interoperable are open source and freely available.
Collapse
Affiliation(s)
- Victor M Castro
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Vivian Gainer
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Nich Wattanasin
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Barbara Benoit
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Andrew Cagan
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Bhaswati Ghosh
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Sergey Goryachev
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Reeta Metta
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Heekyong Park
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - David Wang
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Michael Mendis
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Martin Rees
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Christopher Herrick
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
| | - Shawn N Murphy
- Research Information Science and Computing, Mass General Brigham, Somerville, Massachusetts, USA
- Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
52
|
Wang DD, Li Y, Nguyen XMT, Song RJ, Ho YL, Hu FB, Willett WC, Wilson PWF, Cho K, Gaziano JM, Djoussé L. Dietary Sodium and Potassium Intake and Risk of Non-Fatal Cardiovascular Diseases: The Million Veteran Program. Nutrients 2022; 14:nu14051121. [PMID: 35268096 PMCID: PMC8912456 DOI: 10.3390/nu14051121] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 03/02/2022] [Accepted: 03/03/2022] [Indexed: 11/16/2022] Open
Abstract
Objective: To examine the association between intakes of sodium and potassium and the ratio of sodium to potassium and incident myocardial infarction and stroke. Design, Setting and Participants: Prospective cohort study of 180,156 Veterans aged 19 to 107 years with plausible dietary intake measured by food frequency questionnaire (FFQ) who were free of cardiovascular disease (CVD) and cancer at baseline in the VA Million Veteran Program (MVP). Main outcome measures: CVD defined as non-fatal myocardial infarction (MI) or acute ischemic stroke (AIS) ascertained using high-throughput phenotyping algorithms applied to electronic health records. Results: During up to 8 years of follow-up, we documented 4090 CVD cases (2499 MI and 1712 AIS). After adjustment for confounding factors, a higher sodium intake was associated with a higher risk of CVD, whereas potassium intake was inversely associated with the risk of CVD [hazard ratio (HR) comparing extreme quintiles, 95% confidence interval (CI): 1.09 (95% CI: 0.99−1.21, p trend = 0.01) for sodium and 0.87 (95% CI: 0.79−0.96, p trend = 0.005) for potassium]. In addition, the ratio of sodium to potassium (Na/K ratio) was positively associated with the risk of CVD (HR comparing extreme quintiles = 1.26, 95% CI: 1.14−1.39, p trend < 0.0001). The associations of Na/K ratio were consistent for two subtypes of CVD; one standard deviation increment in the ratio was associated with HRs (95% CI) of 1.12 (1.06−1.19) for MI and 1.11 (1.03−1.19) for AIS. In secondary analyses, the observed associations were consistent across race and status for diabetes, hypertension, and high cholesterol at baseline. Associations appeared to be more pronounced among participants with poor dietary quality. Conclusions: A high sodium intake and a low potassium intake were associated with a higher risk of CVD in this large population of US veterans.
Collapse
Affiliation(s)
- Dong D Wang
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA 02111, USA
- The Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
- Department of Nutrition, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Yanping Li
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA 02111, USA
- Department of Nutrition, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Xuan-Mai T Nguyen
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA 02111, USA
- Division of Aging, Department of Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Rebecca J Song
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA 02111, USA
- Department of Epidemiology, Boston University School of Public Health, Boston, MA 02115, USA
| | - Yuk-Lam Ho
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA 02111, USA
| | - Frank B Hu
- The Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
- Department of Nutrition, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Walter C Willett
- The Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA
- Department of Nutrition, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
| | - Peter W F Wilson
- Atlanta VA Medical Center, Atlanta, GA 30033, USA
- Emory Clinical Cardiovascular Research Institute, Atlanta, GA 30033, USA
| | - Kelly Cho
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA 02111, USA
- Division of Aging, Department of Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - J Michael Gaziano
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA 02111, USA
- Division of Aging, Department of Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Luc Djoussé
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA 02111, USA
- Department of Nutrition, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA
- Division of Aging, Department of Medicine, Brigham and Women's Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
53
|
Cade BE, Hassan SM, Dashti HS, Kiernan M, Pavlova MK, Redline S, Karlson EW. Sleep apnea phenotyping and relationship to disease in a large clinical biobank. JAMIA Open 2022; 5:ooab117. [PMID: 35156000 PMCID: PMC8826997 DOI: 10.1093/jamiaopen/ooab117] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 12/08/2021] [Accepted: 12/28/2021] [Indexed: 11/14/2022] Open
Abstract
Objective Sleep apnea is associated with a broad range of pathophysiology. While electronic health record (EHR) information has the potential for revealing relationships between sleep apnea and associated risk factors and outcomes, practical challenges hinder its use. Our objectives were to develop a sleep apnea phenotyping algorithm that improves the precision of EHR case/control information using natural language processing (NLP); identify novel associations between sleep apnea and comorbidities in a large clinical biobank; and investigate the relationship between polysomnography statistics and comorbid disease using NLP phenotyping. Materials and Methods We performed clinical chart reviews on 300 participants putatively diagnosed with sleep apnea and applied International Classification of Sleep Disorders criteria to classify true cases and noncases. We evaluated 2 NLP and diagnosis code-only methods for their abilities to maximize phenotyping precision. The lead algorithm was used to identify incident and cross-sectional associations between sleep apnea and common comorbidities using 4876 NLP-defined sleep apnea cases and 3× matched controls. Results The optimal NLP phenotyping strategy had improved model precision (≥0.943) compared to the use of one diagnosis code (≤0.733). Of the tested diseases, 170 disorders had significant incidence odds ratios (ORs) between cases and controls, 8 of which were confirmed using polysomnography (n = 4544), and 281 disorders had significant prevalence OR between sleep apnea cases versus controls, 41 of which were confirmed using polysomnography data. Discussion and Conclusion An NLP-informed algorithm can improve the accuracy of case-control sleep apnea ascertainment and thus improve the performance of phenome-wide, genetic, and other EHR analyses of a highly prevalent disorder. Sleep apnea is a common disease in which breathing partially or completely pauses during sleep, leading to less oxygen in the blood, repeated awakenings, and increased risk of developing multiple diseases. Current studies of sleep apnea often have relatively few participants due to the challenge of performing overnight sleep recordings. Electronic health record (EHR) billing code diagnoses of sleep apnea could be repurposed to increase the size of research studies, but the accuracy of the diagnoses is reduced. We developed a reusable algorithm that improves the accuracy of EHR sleep apnea diagnoses using natural language processing to extract information from clinical notes. As a proof of concept, we used the algorithm to identify hundreds of diseases that are increased among participants with sleep apnea compared to similar patients without sleep apnea. Many of these disease relationships with sleep apnea have not been previously recognized. This improved algorithm will help to accelerate future large-scale investigations of the causes and consequences of sleep apnea.
Collapse
Affiliation(s)
- Brian E Cade
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, Massachusetts, USA
- Division of Sleep Medicine, Harvard Medical School, Boston, Massachusetts, USA
- Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA
| | - Syed Moin Hassan
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, Massachusetts, USA
- Division of Sleep Medicine, Harvard Medical School, Boston, Massachusetts, USA
- Division of Pulmonary Disease and Critical Care Medicine, University of Vermont, Burlington, Vermont, USA
| | - Hassan S Dashti
- Program in Medical and Population Genetics, Broad Institute, Cambridge, Massachusetts, USA
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
- Department of Anesthesia, Pain, and Critical Care Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA
| | - Melissa Kiernan
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, Massachusetts, USA
- NeuroCare Center for Sleep, Newton, Massachusetts, USA
| | - Milena K Pavlova
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, Massachusetts, USA
- Division of Sleep Medicine, Harvard Medical School, Boston, Massachusetts, USA
| | - Susan Redline
- Division of Sleep and Circadian Disorders, Brigham and Women’s Hospital, Boston, Massachusetts, USA
- Division of Sleep Medicine, Harvard Medical School, Boston, Massachusetts, USA
- Division of Pulmonary, Critical Care, and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
| | - Elizabeth W Karlson
- Center for Genomic Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
- Division of Rheumatology, Inflammation and Immunity, Brigham and Women's Hospital, Boston, Massachusetts, USA
| |
Collapse
|
54
|
Zhang Y, Liu M, Neykov M, Cai T. Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:83. [PMID: 37974910 PMCID: PMC10653017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/19/2023]
Abstract
Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, p , is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.
Collapse
Affiliation(s)
- Yichi Zhang
- Department of Computer Science and Statistics, University of Rhode Island
| | - Molei Liu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| | - Matey Neykov
- Department of Statistics and Data Science, Carnegie Mellon University
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| |
Collapse
|
55
|
Artificial Intelligence in Clinical Immunology. Artif Intell Med 2022. [DOI: 10.1007/978-3-030-64573-1_83] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
56
|
Liang L, Kim N, Hou J, Cai T, Dahal K, Lin C, Finan S, Savovoa G, Rosso M, Polgar-Tucsanyi M, Weiner H, Chitnis T, Cai T, Xia Z. Temporal trends of multiple sclerosis disease activity: Electronic health records indicators. Mult Scler Relat Disord 2022; 57:103333. [PMID: 35158446 PMCID: PMC8849591 DOI: 10.1016/j.msard.2021.103333] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 10/03/2021] [Accepted: 10/14/2021] [Indexed: 01/03/2023]
Abstract
BACKGROUND Long-term data on multiple sclerosis (MS) inflammatory disease activity are limited. We examined electronic health records (EHR) indicators of disease activity in people with MS. METHODS We analyzed prospectively collected research registry data and linked EHR data in a clinic-based cohort from 2000 to 2016. We used the trend of the yearly incident relapse rate from the registry data as benchmark. We then calculated the temporal trends of potentially relevant EHR measures, including mean count of the MS diagnostic code, mentions of MS-related concepts, MS-related health utilizations and selected prescriptions. RESULTS 1,555 MS patients had both registry and EHR data. Between 2000 and 2016, the registry data showed a declining trend in the yearly incident relapse rate, parallel to an increasing trend of DMT usage. Among the EHR measures, covariate-adjusted frequency of diagnostic code of MS, procedure codes of MS-related imaging studies and emergency room visits, and electronic prescription for steroids declined over time, mirroring the temporal trend of the benchmark yearly incident relapse rate. CONCLUSION This study highlights EHR indicators of MS relapse that could enable large-scale examination of long-term disease activities or inform individual patient monitoring in clinical settings where EHR data are available.
Collapse
Affiliation(s)
- Liang Liang
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Nicole Kim
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Jue Hou
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Tianrun Cai
- Division of Rheumatology, Department of Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Kumar Dahal
- Division of Rheumatology, Department of Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Chen Lin
- Clinical Natural Language Processing Program, Boston Children’s Hospital, Boston, MA, USA
| | - Sean Finan
- Clinical Natural Language Processing Program, Boston Children’s Hospital, Boston, MA, USA
| | - Guergana Savovoa
- Clinical Natural Language Processing Program, Boston Children’s Hospital, Boston, MA, USA
| | - Mattia Rosso
- Department of Neurology, Brigham and Women’s Hospital, Boston, MA, USA
| | | | - Howard Weiner
- Department of Neurology, Brigham and Women’s Hospital, Boston, MA, USA
| | - Tanuja Chitnis
- Department of Neurology, Brigham and Women’s Hospital, Boston, MA, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Zongqi Xia
- Department of Neurology and Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
57
|
Zaccaria GM, Colella V, Colucci S, Clemente F, Pavone F, Vegliante MC, Esposito F, Opinto G, Scattone A, Loseto G, Minoia C, Rossini B, Quinto AM, Angiulli V, Grieco LA, Fama A, Ferrero S, Moia R, Di Rocco A, Quaglia FM, Tabanelli V, Guarini A, Ciavarella S. Electronic case report forms generation from pathology reports by ARGO, automatic record generator for onco-hematology. Sci Rep 2021; 11:23823. [PMID: 34893665 PMCID: PMC8664934 DOI: 10.1038/s41598-021-03204-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Accepted: 11/23/2021] [Indexed: 12/04/2022] Open
Abstract
The unstructured nature of Real-World (RW) data from onco-hematological patients and the scarce accessibility to integrated systems restrain the use of RW information for research purposes. Natural Language Processing (NLP) might help in transposing unstructured reports into standardized electronic health records. We exploited NLP to develop an automated tool, named ARGO (Automatic Record Generator for Onco-hematology) to recognize information from pathology reports and populate electronic case report forms (eCRFs) pre-implemented by REDCap. ARGO was applied to hemo-lymphopathology reports of diffuse large B-cell, follicular, and mantle cell lymphomas, and assessed for accuracy (A), precision (P), recall (R) and F1-score (F) on internal (n = 239) and external (n = 93) report series. 326 (98.2%) reports were converted into corresponding eCRFs. Overall, ARGO showed high performance in capturing (1) identification report number (all metrics > 90%), (2) biopsy date (all metrics > 90% in both series), (3) specimen type (86.6% and 91.4% of A, 98.5% and 100.0% of P, 92.5% and 95.5% of F, and 87.2% and 91.4% of R for internal and external series, respectively), (4) diagnosis (100% of P with A, R and F of 90% in both series). We developed and validated a generalizable tool that generates structured eCRFs from real-life pathology reports.
Collapse
Affiliation(s)
- Gian Maria Zaccaria
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy.
| | - Vito Colella
- Department of Electrical and Information Engineering, Politecnico of Bari, Bari, Italy
| | - Simona Colucci
- Department of Electrical and Information Engineering, Politecnico of Bari, Bari, Italy
| | - Felice Clemente
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Fabio Pavone
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Maria Carmela Vegliante
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Flavia Esposito
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy.,Department of Mathematics, University of Bari Aldo Moro, Bari, Italy
| | - Giuseppina Opinto
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Anna Scattone
- Pathology Department, IRCCS Istituto Tumori 'Giovanni Paolo II', Bari, Italy
| | - Giacomo Loseto
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Carla Minoia
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Bernardo Rossini
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Angela Maria Quinto
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Vito Angiulli
- Clinical Engineering Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Bari, Italy
| | - Luigi Alfredo Grieco
- Department of Electrical and Information Engineering, Politecnico of Bari, Bari, Italy
| | - Angelo Fama
- Hematology, Azienda USL - IRCCS Di Reggio Emilia, Reggio Emilia, Italy
| | - Simone Ferrero
- Division of Hematology 1, AOU "Città Della Salute e Della Scienza di Torino", Torino, Italy.,Department of Molecular Biotechnologies and Health Sciences, University of Torino, Torino, Italy
| | - Riccardo Moia
- Division of Hematology, Azienda Ospedaliero-Universitaria Maggiore Della Carità Di Novara, Novara, Italy
| | - Alice Di Rocco
- Unit of Hematology, Azienda Ospedaliero-Universitaria Policlinico Umberto I, Roma, Italy
| | | | - Valentina Tabanelli
- Division of Diagnostic Haematopathology, European Institute of Oncology, IRCCS, Milano, Italy
| | - Attilio Guarini
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| | - Sabino Ciavarella
- Hematology and Cell Therapy Unit, IRCCS Istituto Tumori 'Giovanni Paolo II', Viale Orazio Flacco, 65, Bari, Italy
| |
Collapse
|
58
|
Hou J, Kim N, Cai T, Dahal K, Weiner H, Chitnis T, Cai T, Xia Z. Comparison of Dimethyl Fumarate vs Fingolimod and Rituximab vs Natalizumab for Treatment of Multiple Sclerosis. JAMA Netw Open 2021; 4:e2134627. [PMID: 34783826 PMCID: PMC8596196 DOI: 10.1001/jamanetworkopen.2021.34627] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Accepted: 09/20/2021] [Indexed: 01/17/2023] Open
Abstract
Importance As disease-modifying treatment options for multiple sclerosis increase, comparisons of the options based on real-world evidence may guide clinical decision-making. Objective To compare the relapse outcomes between 2 pairs of disease-modifying treatments: dimethyl fumarate vs fingolimod and natalizumab vs rituximab. Design, Setting, and Participants This comparative effectiveness study integrated data from a clinic-based multiple sclerosis research registry and its linked electronic health records (EHR) system between January 1, 2006, and December 31, 2016, and built treatment groups for each pairwise disease-modifying treatment comparison according to both registry records and electronic prescriptions. Parallel analyses were conducted from October 11, 2019, to July 7, 2021. Main Outcomes and Measures The main outcomes were the 1-year and 2-year relapse rates as well as the time to relapse. To compare relapse outcomes, the study adjusted for covariates from 2 sources (registry and EHR) and corrected for confounding biases among the covariates by the doubly robust estimation. Results The study included 4 treatment groups: dimethyl fumarate (n = 260; 198 women [76.2%]; 227 non-Hispanic White individuals [87.3%]; mean [SD] age at diagnosis, 41.7 [10.4] years), fingolimod (n = 267; 190 women [71.2%]; 222 non-Hispanic White individuals [83.1%]; mean [SD] age at diagnosis, 37.9 [9.9] years), natalizumab (n = 204; 160 women [78.4%]; 172 non-Hispanic White individuals [84.3%]; mean [SD] age at diagnosis, 37.2 [10.6] years), and rituximab (n = 115; 83 women [72.2%]; 99 non-Hispanic White individuals [86.1%]; mean [SD] age at diagnosis, 44.1 [11.1] years). No significant differences were found in the relapse outcomes between dimethyl fumarate and fingolimod after correcting for confounding biases and multiple testing (difference in 1-year relapse rate, 0.028 [95% CI, -0.031 to 0.084]; difference in 2-year relapse rate, 0.071 [95% CI, 0.008-0.128]; relative risk of 2-year non-relapse, 0.957 [95% CI, 0.884-1.035] with dimethyl fumarate as reference). When compared with rituximab, natalizumab was associated with a higher relapse rate for all 3 outcomes after bias correction and multiple testing (difference in 1-year relapse rate, 0.080 [95% CI, 0.013-0.137]; difference in 2-year relapse rate, 0.132 [95% CI, 0.043-0.189]; relative risk of 2-year non-relapse, 0.903 [95% CI, 0.822-0.944]). Confounders were identified from EHR data not recorded in the registry data through data-driven feature selection. Conclusions and Relevance This study reports real-world evidence of equivalent relapse outcomes between dimethyl fumarate and fingolimod and relapse reduction in favor of rituximab relative to natalizumab. This approach illustrates the value of incorporating EHR data as high-dimensional covariates in real-world treatment comparison.
Collapse
Affiliation(s)
- Jue Hou
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| | - Nicole Kim
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| | - Tianrun Cai
- Division of Rheumatology, Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts
| | - Kumar Dahal
- Division of Rheumatology, Department of Medicine, Brigham and Women’s Hospital, Boston, Massachusetts
| | - Howard Weiner
- Department of Neurology, Brigham and Women’s Hospital, Boston, Massachusetts
| | - Tanuja Chitnis
- Department of Neurology, Brigham and Women’s Hospital, Boston, Massachusetts
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts
| | - Zongqi Xia
- Department of Neurology, University of Pittsburgh, Pittsburgh, Pennsylvania
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania
| |
Collapse
|
59
|
Hong C, Rush E, Liu M, Zhou D, Sun J, Sonabend A, Castro VM, Schubert P, Panickan VA, Cai T, Costa L, He Z, Link N, Hauser R, Gaziano JM, Murphy SN, Ostrouchov G, Ho YL, Begoli E, Lu J, Cho K, Liao KP, Cai T. Clinical knowledge extraction via sparse embedding regression (KESER) with multi-center large scale electronic health record data. NPJ Digit Med 2021; 4:151. [PMID: 34707226 PMCID: PMC8551205 DOI: 10.1038/s41746-021-00519-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 09/13/2021] [Indexed: 11/11/2022] Open
Abstract
The increasing availability of electronic health record (EHR) systems has created enormous potential for translational research. However, it is difficult to know all the relevant codes related to a phenotype due to the large number of codes available. Traditional data mining approaches often require the use of patient-level data, which hinders the ability to share data across institutions. In this project, we demonstrate that multi-center large-scale code embeddings can be used to efficiently identify relevant features related to a disease of interest. We constructed large-scale code embeddings for a wide range of codified concepts from EHRs from two large medical centers. We developed knowledge extraction via sparse embedding regression (KESER) for feature selection and integrative network analysis. We evaluated the quality of the code embeddings and assessed the performance of KESER in feature selection for eight diseases. Besides, we developed an integrated clinical knowledge map combining embedding data from both institutions. The features selected by KESER were comprehensive compared to lists of codified data generated by domain experts. Features identified via KESER resulted in comparable performance to those built upon features selected manually or with patient-level data. The knowledge map created using an integrative analysis identified disease-disease and disease-drug pairs more accurately compared to those identified using single institution data. Analysis of code embeddings via KESER can effectively reveal clinical knowledge and infer relatedness among codified concepts. KESER bypasses the need for patient-level data in individual analyses providing a significant advance in enabling multi-center studies using EHR data.
Collapse
Affiliation(s)
- Chuan Hong
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
| | - Everett Rush
- Department of Energy, Oak Ridge National Lab, Oak Ridge, TN, USA
| | - Molei Liu
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | | - Jiehuan Sun
- University of Illinois at Chicago, Chicago, IL, USA
| | - Aaron Sonabend
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | | | | | | - Tianrun Cai
- VA Boston Healthcare System, Boston, MA, USA
- Mass General Brigham, Boston, MA, USA
| | | | - Zeling He
- Mass General Brigham, Boston, MA, USA
| | | | | | - J Michael Gaziano
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
| | | | | | - Yuk-Lam Ho
- VA Boston Healthcare System, Boston, MA, USA
| | - Edmon Begoli
- Department of Energy, Oak Ridge National Lab, Oak Ridge, TN, USA
| | - Junwei Lu
- VA Boston Healthcare System, Boston, MA, USA
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Kelly Cho
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
| | - Katherine P Liao
- Harvard Medical School, Boston, MA, USA
- VA Boston Healthcare System, Boston, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
| | - Tianxi Cai
- Harvard Medical School, Boston, MA, USA.
- VA Boston Healthcare System, Boston, MA, USA.
- Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
60
|
Le TT, Gutiérrez-Sacristán A, Son J, Hong C, South AM, Beaulieu-Jones BK, Loh NHW, Luo Y, Morris M, Ngiam KY, Patel LP, Samayamuthu MJ, Schriver E, Tan ALM, Moore J, Cai T, Omenn GS, Avillach P, Kohane IS, Visweswaran S, Mowery DL, Xia Z. Multinational characterization of neurological phenotypes in patients hospitalized with COVID-19. Sci Rep 2021; 11:20238. [PMID: 34642371 PMCID: PMC8510999 DOI: 10.1038/s41598-021-99481-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 09/23/2021] [Indexed: 01/08/2023] Open
Abstract
Neurological complications worsen outcomes in COVID-19. To define the prevalence of neurological conditions among hospitalized patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test in geographically diverse multinational populations during early pandemic, we used electronic health records (EHR) from 338 participating hospitals across 6 countries and 3 continents (January-September 2020) for a cross-sectional analysis. We assessed the frequency of International Classification of Disease code of neurological conditions by countries, healthcare systems, time before and after admission for COVID-19 and COVID-19 severity. Among 35,177 hospitalized patients with SARS-CoV-2 infection, there was an increase in the proportion with disorders of consciousness (5.8%, 95% confidence interval [CI] 3.7-7.8%, pFDR < 0.001) and unspecified disorders of the brain (8.1%, 5.7-10.5%, pFDR < 0.001) when compared to the pre-admission proportion. During hospitalization, the relative risk of disorders of consciousness (22%, 19-25%), cerebrovascular diseases (24%, 13-35%), nontraumatic intracranial hemorrhage (34%, 20-50%), encephalitis and/or myelitis (37%, 17-60%) and myopathy (72%, 67-77%) were higher for patients with severe COVID-19 when compared to those who never experienced severe COVID-19. Leveraging a multinational network to capture standardized EHR data, we highlighted the increased prevalence of central and peripheral neurological phenotypes in patients hospitalized with COVID-19, particularly among those with severe disease.
Collapse
Affiliation(s)
- Trang T Le
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | | | - Jiyeon Son
- Department of Neurology, University of Pittsburgh, Biomedical Science Tower 3, Suite 7014, 3501 5th Avenue, Pittsburgh, PA, 15260, USA
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Andrew M South
- Department of Pediatrics, Wake Forest School of Medicine, Winston Salem, NC, USA
| | | | - Ne Hooi Will Loh
- Department of Critical Care, National University Health Systems, Singapore, Singapore
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Michele Morris
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Kee Yuan Ngiam
- Department of Surgery, National University Health Systems, Singapore, Singapore
| | - Lav P Patel
- Department of Internal Medicine, University of Kansas Medical Center, Kansas City, KS, USA
| | | | - Emily Schriver
- Data Analytics Center, University of Pennsylvania Health System, Philadelphia, PA, USA
| | - Amelia L M Tan
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Jason Moore
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Danielle L Mowery
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Zongqi Xia
- Department of Neurology, University of Pittsburgh, Biomedical Science Tower 3, Suite 7014, 3501 5th Avenue, Pittsburgh, PA, 15260, USA.
| |
Collapse
|
61
|
Cai T, Cai F, Dahal KP, Cremone G, Lam E, Golnik C, Seyok T, Hong C, Cai T, Liao KP. Improving the Efficiency of Clinical Trial Recruitment Using an Ensemble Machine Learning to Assist With Eligibility Screening. ACR Open Rheumatol 2021; 3:593-600. [PMID: 34296815 PMCID: PMC8449035 DOI: 10.1002/acr2.11289] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 05/18/2021] [Indexed: 11/22/2022] Open
Abstract
Objective Efficiently identifying eligible patients is a crucial first step for a successful clinical trial. The objective of this study was to test whether an approach using electronic health record (EHR) data and an ensemble machine learning algorithm incorporating billing codes and data from clinical notes processed by natural language processing (NLP) can improve the efficiency of eligibility screening. Methods We studied patients screened for a clinical trial of rheumatoid arthritis (RA) with one or more International Classification of Diseases (ICD) code for RA and age greater than 35 years, from a tertiary care center and a community hospital. The following three groups of EHR features were considered for the algorithm: 1) structured features, 2) the counts of NLP concepts from notes, 3) health care utilization. All features were linked to dates. We applied random forest and logistic regression with least absolute shrinkage and selection operator penalty against the following two standard approaches: 1) one or more RA ICD code and no ICD codes related to exclusion criteria (ScreenRAICD1+EX) and 2) two or more RA ICD codes (ScreenRAICD2). To test the portability, we trained the algorithm at one institution and tested it at the other. Results In total, 3359 patients at Brigham and Women’s Hospital (BWH) and 642 patients at Faulkner Hospital (FH) were studied, with 461 (13.7%) eligible patients at BWH and 84 (13.4%) at FH. The application of the algorithm reduced ineligible patients from chart review by 40.5% at the tertiary care center and by 57.0% at the community hospital. In contrast, ScreenRAICD2 reduced patients for chart review by 2.7% to 11.3%; ScreenRAICD1+EX reduced patients for chart review by 63% to 65% but excluded 22% to 27% of eligible patients. Conclusion The ensemble machine learning algorithm incorporating billing codes and NLP data increased the efficiency of eligibility screening by reducing the number of patients requiring chart review while not excluding eligible patients. Moreover, this approach can be trained at one institution and applied at another for multicenter clinical trials.
Collapse
Affiliation(s)
- Tianrun Cai
- Brigham and Women's Hospital, Boston, Massachusetts, United States
| | - Fiona Cai
- Massachusetts Institute of Technology, Cambridge, Massachusetts, United States
| | - Kumar P Dahal
- Brigham and Women's Hospital, Boston, Massachusetts, United States
| | | | - Ethan Lam
- Brigham and Women's Hospital, Boston, Massachusetts, United States
| | - Charlotte Golnik
- Brigham and Women's Hospital, Boston, Massachusetts, United States
| | - Thany Seyok
- Brigham and Women's Hospital, Boston, Massachusetts, United States
| | - Chuan Hong
- Harvard University, Boston, Massachusetts, United States
| | - Tianxi Cai
- Harvard University, Boston, Massachusetts, United States
| | - Katherine P Liao
- Brigham and Women's Hospital, Harvard University, and Veterans Affairs Boston Healthcare System, Boston, Massachusetts, United States
| |
Collapse
|
62
|
Abstract
Machine learning can be used to make sense of healthcare data. Probabilistic machine learning models help provide a complete picture of observed data in healthcare. In this review, we examine how probabilistic machine learning can advance healthcare. We consider challenges in the predictive model building pipeline where probabilistic models can be beneficial, including calibration and missing data. Beyond predictive models, we also investigate the utility of probabilistic machine learning models in phenotyping, in generative models for clinical use cases, and in reinforcement learning.
Collapse
Affiliation(s)
- Irene Y Chen
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA;
| | | | - Marzyeh Ghassemi
- Vector Institute, Toronto, Ontario M5G 1M1, Canada; .,Institute for Medical and Evaluative Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Rajesh Ranganath
- Department of Computer Science, Courant Institute, New York University, New York, NY 10012, USA.,Center for Data Science, New York University, New York, NY 10012, USA.,Department of Population Health, New York University Grossman School of Medicine, New York, NY 10016, USA
| |
Collapse
|
63
|
Abstract
Electronic health records (EHRs) are a rich source of data for researchers, but extracting meaningful information out of this highly complex data source is challenging. Phecodes represent one strategy for defining phenotypes for research using EHR data. They are a high-throughput phenotyping tool based on ICD (International Classification of Diseases) codes that can be used to rapidly define the case/control status of thousands of clinically meaningful diseases and conditions. Phecodes were originally developed to conduct phenome-wide association studies to scan for phenotypic associations with common genetic variants. Since then, phecodes have been used to support a wide range of EHR-based phenotyping methods, including the phenotype risk score. This review aims to comprehensively describe the development, validation, and applications of phecodes and suggest some future directions for phecodes and high-throughput phenotyping.
Collapse
Affiliation(s)
- Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee 37232, USA;
| |
Collapse
|
64
|
Yuan Q, Cai T, Hong C, Du M, Johnson BE, Lanuti M, Cai T, Christiani DC. Performance of a Machine Learning Algorithm Using Electronic Health Record Data to Identify and Estimate Survival in a Longitudinal Cohort of Patients With Lung Cancer. JAMA Netw Open 2021; 4:e2114723. [PMID: 34232304 PMCID: PMC8264641 DOI: 10.1001/jamanetworkopen.2021.14723] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
IMPORTANCE Electronic health records (EHRs) provide a low-cost means of accessing detailed longitudinal clinical data for large populations. A lung cancer cohort assembled from EHR data would be a powerful platform for clinical outcome studies. OBJECTIVE To investigate whether a clinical cohort assembled from EHRs could be used in a lung cancer prognosis study. DESIGN, SETTING, AND PARTICIPANTS In this cohort study, patients with lung cancer were identified among 76 643 patients with at least 1 lung cancer diagnostic code deposited in an EHR in Mass General Brigham health care system from July 1988 to October 2018. Patients were identified via a semisupervised machine learning algorithm, for which clinical information was extracted from structured and unstructured data via natural language processing tools. Data completeness and accuracy were assessed by comparing with the Boston Lung Cancer Study and against criterion standard EHR review results. A prognostic model for non-small cell lung cancer (NSCLC) overall survival was further developed for clinical application. Data were analyzed from March 2019 through July 2020. EXPOSURES Clinical data deposited in EHRs for cohort construction and variables of interest for the prognostic model were collected. MAIN OUTCOMES AND MEASURES The primary outcomes were the performance of the lung cancer classification model and the quality of the extracted variables; the secondary outcome was the performance of the prognostic model. RESULTS Among 76 643 patients with at least 1 lung cancer diagnostic code, 42 069 patients were identified as having lung cancer, with a positive predictive value of 94.4%. The study cohort consisted of 35 375 patients (16 613 men [47.0%] and 18 756 women [53.0%]; 30 140 White individuals [85.2%], 1040 Black individuals [2.9%], and 857 Asian individuals [2.4%]) after excluding patients with lung cancer history and less than 14 days of follow-up after initial diagnosis. The median (interquartile range) age at diagnosis was 66.7 (58.4-74.1) years. The area under the receiver operating characteristic curves of the prognostic model for overall survival with NSCLC were 0.828 (95% CI, 0.815-0.842) for 1-year prediction, 0.825 (95% CI, 0.812-0.836) for 2-year prediction, 0.814 (95% CI, 0.800-0.826) for 3-year prediction, 0.814 (95% CI, 0.799-0.828) for 4-year prediction, and 0.812 (95% CI, 0.798-0.825) for 5-year prediction. CONCLUSIONS AND RELEVANCE These findings suggest the feasibility of assembling a large-scale EHR-based lung cancer cohort with detailed longitudinal clinical measurements and that EHR data may be applied in cancer progression with a set of generalizable approaches.
Collapse
Affiliation(s)
- Qianyu Yuan
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| | - Tianrun Cai
- Division of Rheumatology, Immunology, and Allergy, Brigham and Women’s Hospital, Boston, Massachusetts
| | - Chuan Hong
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts
| | - Mulong Du
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
- Department of Biostatistics, Center for Global Health, School of Public Health, Nanjing Medical University, Nanjing, China
| | - Bruce E. Johnson
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts
- Center for Cancer Genomics, Dana-Farber Cancer Institute, Boston, Massachusetts
| | - Michael Lanuti
- Center for Thoracic Cancers, Division of Thoracic Surgery, Massachusetts General Hospital Cancer Center, Boston, Massachusetts
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts
| | - David C. Christiani
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
- Department of Medicine, Massachusetts General Hospital/Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
65
|
Lee J, Liu C, Kim JH, Butler A, Shang N, Pang C, Natarajan K, Ryan P, Ta C, Weng C. Comparative effectiveness of medical concept embedding for feature engineering in phenotyping. JAMIA Open 2021; 4:ooab028. [PMID: 34142015 PMCID: PMC8206403 DOI: 10.1093/jamiaopen/ooab028] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 02/23/2021] [Accepted: 05/03/2021] [Indexed: 01/20/2023] Open
Abstract
Objective Feature engineering is a major bottleneck in phenotyping. Properly learned medical concept embeddings (MCEs) capture the semantics of medical concepts, thus are useful for retrieving relevant medical features in phenotyping tasks. We compared the effectiveness of MCEs learned from knowledge graphs and electronic healthcare records (EHR) data in retrieving relevant medical features for phenotyping tasks. Materials and Methods We implemented 5 embedding methods including node2vec, singular value decomposition (SVD), LINE, skip-gram, and GloVe with 2 data sources: (1) knowledge graphs obtained from the observational medical outcomes partnership (OMOP) common data model; and (2) patient-level data obtained from the OMOP compatible electronic health records (EHR) from Columbia University Irving Medical Center (CUIMC). We used phenotypes with their relevant concepts developed and validated by the electronic medical records and genomics (eMERGE) network to evaluate the performance of learned MCEs in retrieving phenotype-relevant concepts. Hits@k% in retrieving phenotype-relevant concepts based on a single and multiple seed concept(s) was used to evaluate MCEs. Results Among all MCEs, MCEs learned by using node2vec with knowledge graphs showed the best performance. Of MCEs based on knowledge graphs and EHR data, MCEs learned by using node2vec with knowledge graphs and MCEs learned by using GloVe with EHR data outperforms other MCEs, respectively. Conclusion MCE enables scalable feature engineering tasks, thereby facilitating phenotyping. Based on current phenotyping practices, MCEs learned by using knowledge graphs constructed by hierarchical relationships among medical concepts outperformed MCEs learned by using EHR data.
Collapse
Affiliation(s)
- Junghwan Lee
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Jae Hyun Kim
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Alex Butler
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Ning Shang
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Chao Pang
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Patrick Ryan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Casey Ta
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York 10032, USA
| |
Collapse
|
66
|
Wen A, Rasmussen LV, Stone D, Liu S, Kiefer R, Adekkanattu P, Brandt PS, Pacheco JA, Luo Y, Wang F, Pathak J, Liu H, Jiang G. CQL4NLP: Development and Integration of FHIR NLP Extensions in Clinical Quality Language for EHR-driven Phenotyping. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:624-633. [PMID: 34457178 PMCID: PMC8378647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Lack of standardized representation of natural language processing (NLP) components in phenotyping algorithms hinders portability of the phenotyping algorithms and their execution in a high-throughput and reproducible manner. The objective of the study is to develop and evaluate a standard-driven approach - CQL4NLP - that integrates a collection of NLP extensions represented in the HL7 Fast Healthcare Interoperability Resources (FHIR) standard into the clinical quality language (CQL). A minimal NLP data model with 11 NLP-specific data elements was created, including six FHIR NLP extensions. All 11 data elements were identified from their usage in real-world phenotyping algorithms. An NLP ruleset generation mechanism was integrated into the NLP2FHIR pipeline and the NLP rulesets enabled comparable performance for a case study with the identification of obesity comorbidities. The NLP ruleset generation mechanism created a reproducible process for defining the NLP components of a phenotyping algorithm and its execution.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Yuan Luo
- Northwestern University, Chicago, IL
| | - Fei Wang
- Weill Cornell Medicine, New York, NY
| | | | | | | |
Collapse
|
67
|
Wang H, Goodman MO, Sofer T, Redline S. Cutting the fat: advances and challenges in sleep apnoea genetics. Eur Respir J 2021; 57:57/5/2004644. [PMID: 33958377 DOI: 10.1183/13993003.04644-2020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 02/10/2021] [Indexed: 01/25/2023]
Affiliation(s)
- Heming Wang
- Division of Sleep and Circadian Disorders, Dept of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Matthew O Goodman
- Division of Sleep and Circadian Disorders, Dept of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Tamar Sofer
- Division of Sleep and Circadian Disorders, Dept of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Susan Redline
- Division of Sleep and Circadian Disorders, Dept of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| |
Collapse
|
68
|
Veiga RV, Schuler-Faccini L, França GVA, Andrade RFS, Teixeira MG, Costa LC, Paixão ES, Costa MDCN, Barreto ML, Oliveira JF, Oliveira WK, Cardim LL, Rodrigues MS. Classification algorithm for congenital Zika Syndrome: characterizations, diagnosis and validation. Sci Rep 2021; 11:6770. [PMID: 33762667 PMCID: PMC7990918 DOI: 10.1038/s41598-021-86361-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 03/09/2021] [Indexed: 11/09/2022] Open
Abstract
Zika virus was responsible for the microcephaly epidemic in Brazil which began in October 2015 and brought great challenges to the scientific community and health professionals in terms of diagnosis and classification. Due to the difficulties in correctly identifying Zika cases, it is necessary to develop an automatic procedure to classify the probability of a CZS case from the clinical data. This work presents a machine learning algorithm capable of achieving this from structured and unstructured available data. The proposed algorithm reached 83% accuracy with textual information in medical records and image reports and 76% accuracy in classifying data without textual information. Therefore, the proposed algorithm has the potential to classify CZS cases in order to clarify the real effects of this epidemic, as well as to contribute to health surveillance in monitoring possible future epidemics.
Collapse
Affiliation(s)
- Rafael V Veiga
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil. .,Instituto de Ciências da Saúde, Universidade Federal da Bahia, Salvador, Bahia, Brazil.
| | | | | | - Roberto F S Andrade
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil.,Instituto de Física, Universidade Federal da Bahia, Salvador, Bahia, Brazil
| | - Maria Glória Teixeira
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil.,Instituto de Saúde Coletiva, Universidade Federal da Bahia, Salvador, Bahia, Brazil
| | - Larissa C Costa
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil
| | - Enny S Paixão
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil.,London School of Hygiene and Tropical Medicine, London, England, United Kingdom
| | - Maria da Conceição N Costa
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil.,Instituto de Saúde Coletiva, Universidade Federal da Bahia, Salvador, Bahia, Brazil
| | - Maurício L Barreto
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil
| | - Juliane F Oliveira
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil.,Department of Mathematics, Centre of Mathematics of the University of Porto (CMUP), Porto, Portugal
| | - Wanderson K Oliveira
- Hospital das Forças Armadas, Ministério da Defesa, Distrito Federal, Brasília, Brazil
| | - Luciana L Cardim
- Center of Data and Knowledge Integration for Health (CIDACS), Instituto Gonçalo Moniz, Fundação Oswaldo Cruz, Salvador, Bahia, Brazil
| | | |
Collapse
|
69
|
Tedeschi SK, Cai T, He Z, Ahuja Y, Hong C, Yates KA, Dahal K, Xu C, Lyu H, Yoshida K, Solomon DH, Cai T, Liao KP. Classifying Pseudogout Using Machine Learning Approaches With Electronic Health Record Data. Arthritis Care Res (Hoboken) 2021; 73:442-448. [PMID: 31910317 DOI: 10.1002/acr.24132] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Accepted: 12/31/2019] [Indexed: 12/19/2022]
Abstract
OBJECTIVE Identifying pseudogout in large data sets is difficult due to its episodic nature and a lack of billing codes specific to this acute subtype of calcium pyrophosphate (CPP) deposition disease. The objective of this study was to evaluate a novel machine learning approach for classifying pseudogout using electronic health record (EHR) data. METHODS We created an EHR data mart of patients with ≥1 relevant billing code or ≥2 natural language processing (NLP) mentions of pseudogout or chondrocalcinosis, 1991-2017. We selected 900 subjects for gold standard chart review for definite pseudogout (synovitis + synovial fluid CPP crystals), probable pseudogout (synovitis + chondrocalcinosis), or not pseudogout. We applied a topic modeling approach to identify definite/probable pseudogout. A combined algorithm included topic modeling plus manually reviewed CPP crystal results. We compared algorithm performance and cohorts identified by billing codes, the presence of CPP crystals, topic modeling, and a combined algorithm. RESULTS Among 900 subjects, 123 (13.7%) had pseudogout by chart review (68 definite, 55 probable). Billing codes had a sensitivity of 65% and a positive predictive value (PPV) of 22% for pseudogout. The presence of CPP crystals had a sensitivity of 29% and a PPV of 92%. Without using CPP crystal results, topic modeling had a sensitivity of 29% and a PPV of 79%. The combined algorithm yielded a sensitivity of 42% and a PPV of 81%. The combined algorithm identified 50% more patients than the presence of CPP crystals; the latter captured a portion of definite pseudogout and missed probable pseudogout. CONCLUSION For pseudogout, an episodic disease with no specific billing code, combining NLP, machine learning methods, and synovial fluid laboratory results yielded an algorithm that significantly boosted the PPV compared to billing codes.
Collapse
Affiliation(s)
- Sara K Tedeschi
- Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts
| | - Tianrun Cai
- Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts
| | - Zeling He
- Brigham and Women's Hospital and Harvard T. H. Chan School of Public Health, Boston, Massachusetts
| | - Yuri Ahuja
- Harvard Medical School, Boston, Massachusetts
| | - Chuan Hong
- Harvard T. H. Chan School of Public Health, Boston, Massachusetts
| | | | - Kumar Dahal
- Brigham and Women's Hospital, Boston, Massachusetts
| | - Chang Xu
- Brigham and Women's Hospital, Boston, Massachusetts
| | - Houchen Lyu
- Brigham and Women's Hospital, Boston, Massachusetts
| | - Kazuki Yoshida
- Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts
| | - Daniel H Solomon
- Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts
| | - Tianxi Cai
- Harvard T. H. Chan School of Public Health and Harvard Medical School, Boston, Massachusetts
| | - Katherine P Liao
- Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
70
|
Kohane IS, Aronow BJ, Avillach P, Beaulieu-Jones BK, Bellazzi R, Bradford RL, Brat GA, Cannataro M, Cimino JJ, García-Barrio N, Gehlenborg N, Ghassemi M, Gutiérrez-Sacristán A, Hanauer DA, Holmes JH, Hong C, Klann JG, Loh NHW, Luo Y, Mandl KD, Daniar M, Moore JH, Murphy SN, Neuraz A, Ngiam KY, Omenn GS, Palmer N, Patel LP, Pedrera-Jiménez M, Sliz P, South AM, Tan ALM, Taylor DM, Taylor BW, Torti C, Vallejos AK, Wagholikar KB, Weber GM, Cai T. What Every Reader Should Know About Studies Using Electronic Health Record Data but May Be Afraid to Ask. J Med Internet Res 2021; 23:e22219. [PMID: 33600347 PMCID: PMC7927948 DOI: 10.2196/22219] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Revised: 09/14/2020] [Accepted: 01/10/2021] [Indexed: 12/13/2022] Open
Abstract
Coincident with the tsunami of COVID-19–related publications, there has been a surge of studies using real-world data, including those obtained from the electronic health record (EHR). Unfortunately, several of these high-profile publications were retracted because of concerns regarding the soundness and quality of the studies and the EHR data they purported to analyze. These retractions highlight that although a small community of EHR informatics experts can readily identify strengths and flaws in EHR-derived studies, many medical editorial teams and otherwise sophisticated medical readers lack the framework to fully critically appraise these studies. In addition, conventional statistical analyses cannot overcome the need for an understanding of the opportunities and limitations of EHR-derived studies. We distill here from the broader informatics literature six key considerations that are crucial for appraising studies utilizing EHR data: data completeness, data collection and handling (eg, transformation), data type (ie, codified, textual), robustness of methods against EHR variability (within and across institutions, countries, and time), transparency of data and analytic code, and the multidisciplinary approach. These considerations will inform researchers, clinicians, and other stakeholders as to the recommended best practices in reviewing manuscripts, grants, and other outputs from EHR-data derived studies, and thereby promote and foster rigor, quality, and reliability of this rapidly growing field.
Collapse
Affiliation(s)
- Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Bruce J Aronow
- Biomedical Informatics, Cincinnati Children's Hospital Medical Center, University of Cincinnati, Cincinnati, OH, United States
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | | | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.,ICS Maugeri, Pavia, Italy
| | - Robert L Bradford
- North Carolina Translational and Clinical Sciences Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Gabriel A Brat
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Mario Cannataro
- Data Analytics Research Center, University Magna Graecia of Catanzaro, Catanzaro, Italy.,Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Catanzaro, Italy
| | - James J Cimino
- Informatics Institute, University of Alabama at Birmingham, Birmingham, AL, United States
| | | | - Nils Gehlenborg
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Marzyeh Ghassemi
- Department of Computer Science and Medicine, University of Toronto, Toronto, ON, Canada
| | | | - David A Hanauer
- Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, United States
| | - John H Holmes
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Jeffrey G Klann
- Department of Medicine, Harvard Medical School, Boston, MA, United States.,Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA, United States
| | | | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, IL, United States
| | - Kenneth D Mandl
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
| | - Mohamad Daniar
- Clinical Research Informatics, Boston Children's Hospital, Boston, MA, United States
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, United States
| | - Shawn N Murphy
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.,Department of Neurology, Massachusetts General Hospital, Boston, MA, United States
| | - Antoine Neuraz
- Department of Biomedical Informatics, Necker-Enfant Malades Hospital, Assistance Publique - Hôpitaux de Paris, Paris, France.,Centre de Recherche des Cordeliers, INSERM UMRS 1138 Team 22, Université de Paris, Paris, France
| | - Kee Yuan Ngiam
- National University Health Systems, Singapore, Singapore
| | - Gilbert S Omenn
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, MI, United States
| | - Nathan Palmer
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Lav P Patel
- Department of Internal Medicine, Division of Medical Informatics, University of Kansas Medical Center, Kansas City, KS, United States
| | | | - Piotr Sliz
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
| | - Andrew M South
- Section of Nephrology, Department of Pediatrics, Brenner Children's Hospital, Wake Forest School of Medicine, Winston Salem, NC, United States
| | - Amelia Li Min Tan
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States.,Department of Biomedical Informatics, National University of Singapore, Singapore, Singapore
| | - Deanne M Taylor
- Department of Biomedical and Health Informatics, The Children's Hospital of Philadelphia, Philadelphia, PA, United States.,Department of Pediatrics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, United States
| | - Bradley W Taylor
- Clinical and Translational Science Institute, Medical College of Wisconsin, Milwaukee, WI, United States
| | - Carlo Torti
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Catanzaro, Italy
| | - Andrew K Vallejos
- Clinical and Translational Science Institute, Medical College of Wisconsin, Milwaukee, WI, United States
| | - Kavishwar B Wagholikar
- Department of Medicine, Harvard Medical School, Boston, MA, United States.,Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA, United States
| | | | - Griffin M Weber
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| |
Collapse
|
71
|
Ahuja Y, Kim N, Liang L, Cai T, Dahal K, Seyok T, Lin C, Finan S, Liao K, Savovoa G, Chitnis T, Cai T, Xia Z. Leveraging electronic health records data to predict multiple sclerosis disease activity. Ann Clin Transl Neurol 2021; 8:800-810. [PMID: 33626237 PMCID: PMC8045951 DOI: 10.1002/acn3.51324] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 12/26/2020] [Accepted: 02/01/2021] [Indexed: 12/26/2022] Open
Abstract
Objective No relapse risk prediction tool is currently available to guide treatment selection for multiple sclerosis (MS). Leveraging electronic health record (EHR) data readily available at the point of care, we developed a clinical tool for predicting MS relapse risk. Methods Using data from a clinic‐based research registry and linked EHR system between 2006 and 2016, we developed models predicting relapse events from the registry in a training set (n = 1435) and tested the model performance in an independent validation set of MS patients (n = 186). This iterative process identified prior 1‐year relapse history as a key predictor of future relapse but ascertaining relapse history through the labor‐intensive chart review is impractical. We pursued two‐stage algorithm development: (1) L1‐regularized logistic regression (LASSO) to phenotype past 1‐year relapse status from contemporaneous EHR data, (2) LASSO to predict future 1‐year relapse risk using imputed prior 1‐year relapse status and other algorithm‐selected features. Results The final model, comprising age, disease duration, and imputed prior 1‐year relapse history, achieved a predictive AUC and F score of 0.707 and 0.307, respectively. The performance was significantly better than the baseline model (age, sex, race/ethnicity, and disease duration) and noninferior to a model containing actual prior 1‐year relapse history. The predicted risk probability declined with disease duration and age. Conclusion Our novel machine‐learning algorithm predicts 1‐year MS relapse with accuracy comparable to other clinical prediction tools and has applicability at the point of care. This EHR‐based two‐stage approach of outcome prediction may have application to neurological disease beyond MS.
Collapse
Affiliation(s)
- Yuri Ahuja
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Nicole Kim
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Liang Liang
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Tianrun Cai
- Division of Rheumatology, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Kumar Dahal
- Division of Rheumatology, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Thany Seyok
- Division of Rheumatology, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Chen Lin
- Clinical Natural Language Processing Program, Boston Children's Hospital, Boston, MA, USA
| | - Sean Finan
- Clinical Natural Language Processing Program, Boston Children's Hospital, Boston, MA, USA
| | - Katherine Liao
- Division of Rheumatology, Department of Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Guergana Savovoa
- Clinical Natural Language Processing Program, Boston Children's Hospital, Boston, MA, USA
| | - Tanuja Chitnis
- Department of Neurology, Brigham and Women's Hospital, Boston, MA, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Zongqi Xia
- Department of Neurology and Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
72
|
Geva A, Liu M, Panickan VA, Avillach P, Cai T, Mandl KD. A high-throughput phenotyping algorithm is portable from adult to pediatric populations. J Am Med Inform Assoc 2021; 28:1265-1269. [PMID: 33594412 DOI: 10.1093/jamia/ocaa343] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 11/27/2020] [Accepted: 12/28/2020] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Multimodal automated phenotyping (MAP) is a scalable, high-throughput phenotyping method, developed using electronic health record (EHR) data from an adult population. We tested transportability of MAP to a pediatric population. MATERIALS AND METHODS Without additional feature engineering or supervised training, we applied MAP to a pediatric population enrolled in a biobank and evaluated performance against physician-reviewed medical records. We also compared performance of MAP at the pediatric institution and the original adult institution where MAP was developed, including for 6 phenotypes validated at both institutions against physician-reviewed medical records. RESULTS MAP performed equally well in the pediatric setting (average AUC 0.98) as it did at the general adult hospital system (average AUC 0.96). MAP's performance in the pediatric sample was similar across the 6 specific phenotypes also validated against gold-standard labels in the adult biobank. CONCLUSIONS MAP is highly transportable across diverse populations and has potential for wide-scale use.
Collapse
Affiliation(s)
- Alon Geva
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA.,Division of Critical Care Medicine, Department of Anesthesiology, Critical Care, and Pain Medicine, Boston Children's Hospital, Boston, Massachusetts, USA.,Department of Anaesthesia, Harvard Medical School, Boston, Massachusetts, USA
| | - Molei Liu
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Vidul A Panickan
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Paul Avillach
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA.,Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | - Kenneth D Mandl
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts, USA.,Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA.,Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
73
|
Le TT, Gutiérrez-Sacristán A, Son J, Hong C, South AM, Beaulieu-Jones BK, Loh NHW, Luo Y, Morris M, Ngiam KY, Patel LP, Samayamuthu MJ, Schriver E, Tan AL, Moore J, Cai T, Omenn GS, Avillach P, Kohane IS, Visweswaran S, Mowery DL, Xia Z. Multinational Prevalence of Neurological Phenotypes in Patients Hospitalized with COVID-19. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021. [PMID: 33655281 PMCID: PMC7924306 DOI: 10.1101/2021.01.27.21249817] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
OBJECTIVE: Neurological complications can worsen outcomes in COVID-19. We defined the prevalence of a wide range of neurological conditions among patients hospitalized with COVID-19 in geographically diverse multinational populations. METHODS: Using electronic health record (EHR) data from 348 participating hospitals across 6 countries and 3 continents between January and September 2020, we performed a cross-sectional study of hospitalized adult and pediatric patients with a positive SARS-CoV-2 reverse transcription polymerase chain reaction test, both with and without severe COVID-19. We assessed the frequency of each disease category and 3-character International Classification of Disease (ICD) code of neurological diseases by countries, sites, time before and after admission for COVID-19, and COVID-19 severity. RESULTS: Among the 35,177 hospitalized patients with SARS-CoV-2 infection, there was increased prevalence of disorders of consciousness (5.8%, 95% confidence interval [CI]: 3.7%−7.8%, pFDR<.001) and unspecified disorders of the brain (8.1%, 95%CI: 5.7%−10.5%, pFDR<.001), compared to pre-admission prevalence. During hospitalization, patients who experienced severe COVID-19 status had 22% (95%CI: 19%−25%) increase in the relative risk (RR) of disorders of consciousness, 24% (95%CI: 13%−35%) increase in other cerebrovascular diseases, 34% (95%CI: 20%−50%) increase in nontraumatic intracranial hemorrhage, 37% (95%CI: 17%−60%) increase in encephalitis and/or myelitis, and 72% (95%CI: 67%−77%) increase in myopathy compared to those who never experienced severe disease. INTERPRETATION: Using an international network and common EHR data elements, we highlight an increase in the prevalence of central and peripheral neurological phenotypes in patients hospitalized with SARS-CoV-2 infection, particularly among those with severe disease.
Collapse
Affiliation(s)
- Trang T Le
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | | | - Jiyeon Son
- Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chuan Hong
- Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Andrew M South
- Department of Pediatrics, Wake Forest School of Medicine, Winston Salem, NC, USA
| | | | - Ne Hooi Will Loh
- Department of Critical Care, National University Health Systems, Singapore
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Michele Morris
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Kee Yuan Ngiam
- Department of Surgery, National University Health Systems, Singapore
| | - Lav P Patel
- Department of Internal Medicine, University of Kansas Medical Center, Kansas City, KS, USA
| | | | - Emily Schriver
- Data Analytics Center, University of Pennsylvania Health System, Philadelphia, PA, USA
| | - Amelia Lm Tan
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Jason Moore
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | | | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Danielle L Mowery
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Zongqi Xia
- Department of Neurology, University of Pittsburgh, PA, USA
| |
Collapse
|
74
|
Artificial Intelligence in Clinical Immunology. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-58080-3_83-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
75
|
Huang S, Huang J, Cai T, Dahal KP, Cagan A, He Z, Stratton J, Gorelik I, Hong C, Cai T, Liao KP. Impact of ICD10 and secular changes on electronic medical record rheumatoid arthritis algorithms. Rheumatology (Oxford) 2020; 59:3759-3766. [PMID: 32413107 DOI: 10.1093/rheumatology/keaa198] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Accepted: 03/17/2020] [Indexed: 12/18/2022] Open
Abstract
OBJECTIVE The objective of this study was to compare the performance of an RA algorithm developed and trained in 2010 utilizing natural language processing and machine learning, using updated data containing ICD10, new RA treatments, and a new electronic medical records (EMR) system. METHODS We extracted data from subjects with ≥1 RA International Classification of Diseases (ICD) codes from the EMR of two large academic centres to create a data mart. Gold standard RA cases were identified from reviewing a random 200 subjects from the data mart, and a random 100 subjects who only have RA ICD10 codes. We compared the performance of the following algorithms using the original 2010 data with updated data: (i) a published 2010 RA algorithm; (ii) updated algorithm, incorporating ICD10 RA codes and new DMARDs; and (iii) published algorithm using ICD codes only, ICD RA code ≥3. RESULTS The gold standard RA cases had mean age 65.5 years, 78.7% female, 74.1% RF or antibodies to cyclic citrullinated peptide (anti-CCP) positive. The positive predictive value (PPV) for ≥3 RA ICD was 54%, compared with 56% in 2010. At a specificity of 95%, the PPV of the 2010 algorithm and the updated version were both 91%, compared with 94% (95% CI: 91, 96%) in 2010. In subjects with ICD10 data only, the PPV for the updated 2010 RA algorithm was 93%. CONCLUSION The 2010 RA algorithm validated with the updated data with similar performance characteristics as the 2010 data. While the 2010 algorithm continued to perform better than the rule-based approach, the PPV of the latter also remained stable over time.
Collapse
Affiliation(s)
- Sicong Huang
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital.,Department of Medicine, Harvard Medical School
| | - Jie Huang
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital
| | - Tianrun Cai
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital.,Department of Medicine, Harvard Medical School
| | - Kumar P Dahal
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital
| | - Andrew Cagan
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital.,Research Information Science and Computing, Partners Healthcare
| | - Zeling He
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital
| | - Jacklyn Stratton
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital
| | - Isaac Gorelik
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Tianxi Cai
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital.,Department of Biomedical Informatics, Harvard Medical School.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Katherine P Liao
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women's Hospital.,Department of Biomedical Informatics, Harvard Medical School
| |
Collapse
|
76
|
Dligach D, Afshar M, Miller T. Pre-training phenotyping classifiers. J Biomed Inform 2020; 113:103626. [PMID: 33259943 DOI: 10.1016/j.jbi.2020.103626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Revised: 09/09/2020] [Accepted: 11/14/2020] [Indexed: 11/17/2022]
Abstract
Recent transformer-based pre-trained language models have become a de facto standard for many text classification tasks. Nevertheless, their utility in the clinical domain, where classification is often performed at encounter or patient level, is still uncertain due to the limitation on the maximum length of input. In this work, we introduce a self-supervised method for pre-training that relies on a masked token objective and is free from the limitation on the maximum input length. We compare the proposed method with supervised pre-training that uses billing codes as a source of supervision. We evaluate the proposed method on one publicly-available and three in-house datasets using the standard evaluation metrics such as the area under the ROC curve and F1 score. We find that, surprisingly, even though self-supervised pre-training performs slightly worse than supervised, it still preserves most of the gains from pre-training.
Collapse
Affiliation(s)
- Dmitriy Dligach
- Loyola University Chicago, Department of Computer Science, Chicago, IL, United States.
| | - Majid Afshar
- Department of Medicine, School of Medicine and Public Health, University of Wisconsin Madison, Madison, WI, United States.
| | - Timothy Miller
- Computational Health Informatics Program (CHIP), Boston Children's Hospital and Harvard Medical School, Boston, MA, United States.
| |
Collapse
|
77
|
Statistical Physics for Medical Diagnostics: Learning, Inference, and Optimization Algorithms. Diagnostics (Basel) 2020; 10:diagnostics10110972. [PMID: 33228143 PMCID: PMC7699346 DOI: 10.3390/diagnostics10110972] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Revised: 11/16/2020] [Accepted: 11/17/2020] [Indexed: 02/03/2023] Open
Abstract
It is widely believed that cooperation between clinicians and machines may address many of the decisional fragilities intrinsic to current medical practice. However, the realization of this potential will require more precise definitions of disease states as well as their dynamics and interactions. A careful probabilistic examination of symptoms and signs, including the molecular profiles of the relevant biochemical networks, will often be required for building an unbiased and efficient diagnostic approach. Analogous problems have been studied for years by physicists extracting macroscopic states of various physical systems by examining microscopic elements and their interactions. These valuable experiences are now being extended to the medical field. From this perspective, we discuss how recent developments in statistical physics, machine learning and inference algorithms are coming together to improve current medical diagnostic approaches.
Collapse
|
78
|
Raghavan S, Ho YL, Vassy JL, Posner D, Honerlaw J, Costa L, Phillips LS, Gagnon DR, Wilson PWF, Cho K. Optimizing Atherosclerotic Cardiovascular Disease Risk Estimation for Veterans With Diabetes Mellitus. Circ Cardiovasc Qual Outcomes 2020; 13:e006528. [PMID: 32862698 PMCID: PMC7914289 DOI: 10.1161/circoutcomes.120.006528] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
BACKGROUND Estimated 10-year atherosclerotic cardiovascular disease (ASCVD) risk in diabetes mellitus patients is used to guide primary prevention, but the performance of risk estimators (2013 Pooled Cohort Equations [PCE] and Risk Equations for Complications of Diabetes [RECODe]) varies across populations. Data from electronic health records could be used to improve risk estimation for a health system's patients. We aimed to evaluate risk equations for initial ASCVD events in US veterans with diabetes mellitus and improve model performance in this population. METHODS AND RESULTS We studied 183 096 adults with diabetes mellitus and without prior ASCVD who received care in the Veterans Affairs Healthcare System (VA) from 2002 to 2016 with mean follow-up of 4.6 years. We evaluated model discrimination, using Harrell's C statistic, and calibration, using the reclassification χ2 test, of the PCE and RECODe equations to predict fatal or nonfatal myocardial infarction or stroke and cardiovascular mortality. We then tested whether model performance was affected by deriving VA-specific β-coefficients. Discrimination of ASCVD events by the PCE was improved by deriving VA-specific β-coefficients (C statistic increased from 0.560 to 0.597) and improved further by including measures of glycemia, renal function, and diabetes mellitus treatment (C statistic, 0.632). Discrimination by the RECODe equations was improved by substituting VA-specific coefficients (C statistic increased from 0.604 to 0.621). Absolute risk estimation by PCE and RECODe equations also improved with VA-specific coefficients; the calibration P increased from <0.001 to 0.08 for PCE and from <0.001 to 0.005 for RECODe, where higher P indicates better calibration. Approximately two-thirds of veterans would meet a guideline indication for high-intensity statin therapy based on the PCE versus only 10% to 15% using VA-fitted models. CONCLUSIONS Existing ASCVD risk equations overestimate risk in veterans with diabetes mellitus, potentially impacting guideline-indicated statin therapy. Prediction model performance can be improved for a health system's patients using readily available electronic health record data.
Collapse
Affiliation(s)
- Sridharan Raghavan
- Veterans Affairs Eastern Colorado Healthcare System, Aurora, CO
- Division of Hospital Medicine, University of Colorado School of Medicine, Aurora, CO
- Colorado Cardiovascular Outcomes Research Consortium, Aurora, CO
| | - Yuk-Lam Ho
- Veterans Affairs Boston Healthcare System, Boston, MA
| | - Jason L. Vassy
- Veterans Affairs Boston Healthcare System, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, MA
| | - Daniel Posner
- Veterans Affairs Boston Healthcare System, Boston, MA
| | | | - Lauren Costa
- Veterans Affairs Boston Healthcare System, Boston, MA
| | - Lawrence S. Phillips
- Atlanta Veterans Affairs Medical Center, Decatur, GA
- Division of Endocrinology, Emory University School of Medicine, Atlanta, GA
| | - David R. Gagnon
- Veterans Affairs Boston Healthcare System, Boston, MA
- Department of Biostatistics, Boston University School of Public Health, Boston, MA
| | - Peter W. F. Wilson
- Atlanta Veterans Affairs Medical Center, Decatur, GA
- Division of Cardiology, Emory University School of Medicine, Atlanta, GA
| | - Kelly Cho
- Veterans Affairs Boston Healthcare System, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
- Division of Aging, Brigham and Women’s Hospital, Boston, MA
| |
Collapse
|
79
|
Daniel C, Kalra D. Clinical Research Informatics. Yearb Med Inform 2020; 29:203-207. [PMID: 32823317 PMCID: PMC7442510 DOI: 10.1055/s-0040-1702007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
Objectives
: To summarize key contributions to current research in the field of Clinical Research Informatics (CRI) and to select best papers published in 2019.
Method
: A bibliographic search using a combination of MeSH descriptors and free-text terms on CRI was performed using PubMed, followed by a double-blind review in order to select a list of candidate best papers to be then peer-reviewed by external reviewers. After peer-review ranking, a consensus meeting between the two section editors and the editorial team was organized to finally conclude on the selected three best papers.
Results
: Among the 517 papers, published in 2019, returned by the search, that were in the scope of the various areas of CRI, the full review process selected three best papers. The first best paper describes the use of a homomorphic encryption technique to enable federated analysis of real-world data while complying more easily with data protection requirements. The authors of the second best paper demonstrate the evidence value of federated data networks reporting a large real world data study related to the first line treatment for hypertension. The third best paper reports the migration of the US Food and Drug Administration (FDA) adverse event reporting system database to the OMOP common data model. This work opens the combined analysis of both spontaneous reporting system and electronic health record (EHR) data for pharmacovigilance.
Conclusions
: The most significant research efforts in the CRI field are currently focusing on real world evidence generation and especially the reuse of EHR data. With the progress achieved this year in the areas of phenotyping, data integration, semantic interoperability, and data quality assessment, real world data is becoming more accessible and reusable. High quality data sets are key assets not only for large scale observational studies or for changing the way clinical trials are conducted but also for developing or evaluating artificial intelligence algorithms guiding clinical decision for more personalized care. And lastly, security and confidentiality, ethical and regulatory issues, and more generally speaking data governance are still active research areas this year.
Collapse
Affiliation(s)
- Christel Daniel
- Information Technology Department, AP-HP, Paris, France.,Sorbonne University, University Paris 13, Sorbonne Paris Cité, INSERM UMR_S 1142, LIMICS, Paris, France
| | | | | |
Collapse
|
80
|
Vassy JL, Lu B, Ho YL, Galloway A, Raghavan S, Honerlaw J, Tarko L, Russo J, Qazi S, Orkaby AR, Tanukonda V, Djousse L, Gaziano JM, Gagnon DR, Cho K, Wilson PWF. Estimation of Atherosclerotic Cardiovascular Disease Risk Among Patients in the Veterans Affairs Health Care System. JAMA Netw Open 2020; 3:e208236. [PMID: 32662843 PMCID: PMC7361654 DOI: 10.1001/jamanetworkopen.2020.8236] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
IMPORTANCE Current guidelines recommend statin therapy for millions of US residents for the primary prevention of atherosclerotic cardiovascular disease (ASCVD). It is unclear whether traditional prediction models that do not account for current widespread statin use are sufficient for risk assessment. OBJECTIVES To examine the performance of the Pooled Cohort Equations (PCE) for 5-year ASCVD risk estimation in a contemporary cohort and to test the hypothesis that inclusion of statin therapy improves model performance. DESIGN, SETTING, AND PARTICIPANTS This cohort study included adult patients in the Veterans Affairs health care system without baseline ASCVD. Using national electronic health record data, 3 Cox proportional hazards models were developed to estimate 5-year ASCVD risk, as follows: the variables and published β coefficients from the PCE (model 1), the PCE variables with cohort-derived β coefficients (model 2), and model 2 plus baseline statin use (model 3). Data were collected from January 2002 to December 2012 and analyzed from June 2016 to March 2020. EXPOSURES Traditional ASCVD risk factors from the PCE plus baseline statin use. MAIN OUTCOMES AND MEASURES Incident ASCVD and ASCVD mortality. RESULTS Of 1 672 336 patients in the cohort (mean [SD] baseline age 58.0 [13.8] years, 1 575 163 [94.2%] men, 1 383 993 [82.8%] white), 312 155 (18.7%) were receiving statin therapy at baseline. During 5 years of follow-up, 66 605 (4.0%) experienced an ASCVD event, and 31 878 (1.9%) experienced ASCVD death. Compared with the original PCE, the cohort-derived model did not improve model discrimination in any of the 4 age-sex strata but did improve model calibration. The PCE overestimated ASCVD risk compared with the cohort-derived model; 211 237 of 1 136 161 white men (18.6%), 29 634 of 218 463 black men (13.6%), 1741 of 44 399 white women (3.9%), and 836 of 16 034 black women (5.2%) would be potentially eligible for statin therapy under the PCE but not the cohort-derived model. When added to the cohort-derived model, baseline statin therapy was associated with a 7% (95% CI, 5%-9%) lower relative risk of ASCVD and a 25% (95% CI, 23%-28%) lower relative risk for ASCVD death. CONCLUSIONS AND RELEVANCE In this study, lower than expected rates of incident ASCVD events in a contemporary national cohort were observed. The PCE overestimated ASCVD risk, and more than 15% of patients would be potentially eligible for statin therapy based on the PCE but not on a cohort-derived model. In the statin era, health care professionals and systems should base ASCVD risk assessment on models calibrated to their patient populations.
Collapse
Affiliation(s)
- Jason L. Vassy
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
| | - Bing Lu
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
| | - Yuk-Lam Ho
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
| | - Ashley Galloway
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
| | - Sridharan Raghavan
- Veterans Affairs Eastern Colorado Healthcare System, Aurora
- Division of Hospital Medicine, University of Colorado School of Medicine, Aurora
- Colorado Cardiovascular Outcomes Research Consortium, Aurora
| | | | - Laura Tarko
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
| | - John Russo
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Landmark College, Putney, Vermont
| | - Saadia Qazi
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
| | - Ariela R. Orkaby
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
| | - Vidisha Tanukonda
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts
| | - Luc Djousse
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
| | - J. Michael Gaziano
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
| | - David R. Gagnon
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts
| | - Kelly Cho
- Veterans Affairs Boston Healthcare System, Boston, Massachusetts
- Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, Massachusetts
| | - Peter W. F. Wilson
- Atlanta Veterans Affairs Medical Center, Decatur, Georgia
- Division of Cardiology, Emory University School of Medicine, Atlanta, Georgia
- Rollins School of Public Health, Department of Epidemiology, Emory University, Atlanta, Georgia
| |
Collapse
|
81
|
Siontis KC, Yao X, Pirruccello JP, Philippakis AA, Noseworthy PA. How Will Machine Learning Inform the Clinical Care of Atrial Fibrillation? Circ Res 2020; 127:155-169. [DOI: 10.1161/circresaha.120.316401] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Machine learning applications in cardiology have rapidly evolved in the past decade. With the availability of machine learning tools coupled with vast data sources, the management of atrial fibrillation (AF), a common chronic disease with significant associated morbidity and socioeconomic impact, is undergoing a knowledge and practice transformation in the increasingly complex healthcare environment. Among other advances, deep-learning machine learning methods, including convolutional neural networks, have enabled the development of AF screening pathways using the ubiquitous 12-lead ECG to detect asymptomatic paroxysmal AF in at-risk populations (such as those with cryptogenic stroke), the refinement of AF and stroke prediction schemes through comprehensive digital phenotyping using structured and unstructured data abstraction from the electronic health record or wearable monitoring technologies, and the optimization of treatment strategies, ranging from stroke prophylaxis to monitoring of antiarrhythmic drug (AAD) therapy. Although the clinical and population-wide impact of these tools continues to be elucidated, such transformative progress does not come without challenges, such as the concerns about adopting black box technologies, assessing input data quality for training such models, and the risk of perpetuating rather than alleviating health disparities. This review critically appraises the advances of machine learning related to the care of AF thus far, their potential future directions, and its potential limitations and challenges.
Collapse
Affiliation(s)
| | - Xiaoxi Yao
- Robert D and Patricia E Kern Center for the Science of Health Care Delivery (X.Y.), Mayo Clinic, Rochester, MN
- Division of Health Care Policy and Research, Department of Health Sciences Research (X.Y.), Mayo Clinic, Rochester, MN
| | - James P. Pirruccello
- Broad Institute, Cambridge, MA (J.P.P., A.A.P.)
- Division of Cardiology, Massachusetts General Hospital, Boston (J.P.P.)
| | | | - Peter A. Noseworthy
- From the Department of Cardiovascular Medicine (K.C.S., P.A.N.), Mayo Clinic, Rochester, MN
| |
Collapse
|
82
|
Zhao SS, Hong C, Cai T, Xu C, Huang J, Ermann J, Goodson NJ, Solomon DH, Cai T, Liao KP. Incorporating natural language processing to improve classification of axial spondyloarthritis using electronic health records. Rheumatology (Oxford) 2020; 59:1059-1065. [PMID: 31535693 DOI: 10.1093/rheumatology/kez375] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/22/2019] [Indexed: 12/13/2022] Open
Abstract
OBJECTIVES To develop classification algorithms that accurately identify axial SpA (axSpA) patients in electronic health records, and compare the performance of algorithms incorporating free-text data against approaches using only International Classification of Diseases (ICD) codes. METHODS An enriched cohort of 7853 eligible patients was created from electronic health records of two large hospitals using automated searches (⩾1 ICD codes combined with simple text searches). Key disease concepts from free-text data were extracted using NLP and combined with ICD codes to develop algorithms. We created both supervised regression-based algorithms-on a training set of 127 axSpA cases and 423 non-cases-and unsupervised algorithms to identify patients with high probability of having axSpA from the enriched cohort. Their performance was compared against classifications using ICD codes only. RESULTS NLP extracted four disease concepts of high predictive value: ankylosing spondylitis, sacroiliitis, HLA-B27 and spondylitis. The unsupervised algorithm, incorporating both the NLP concept and ICD code for AS, identified the greatest number of patients. By setting the probability threshold to attain 80% positive predictive value, it identified 1509 axSpA patients (mean age 53 years, 71% male). Sensitivity was 0.78, specificity 0.94 and area under the curve 0.93. The two supervised algorithms performed similarly but identified fewer patients. All three outperformed traditional approaches using ICD codes alone (area under the curve 0.80-0.87). CONCLUSION Algorithms incorporating free-text data can accurately identify axSpA patients in electronic health records. Large cohorts identified using these novel methods offer exciting opportunities for future clinical research.
Collapse
Affiliation(s)
- Sizheng Steven Zhao
- Institute of Ageing and Chronic Disease, University of Liverpool.,Department of Academic Rheumatology, Aintree University Hospital, Liverpool, UK.,Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital
| | | | - Tianrun Cai
- Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital.,Harvard Medical School
| | - Chang Xu
- Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital
| | - Jie Huang
- Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital
| | - Joerg Ermann
- Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital.,Harvard Medical School
| | - Nicola J Goodson
- Institute of Ageing and Chronic Disease, University of Liverpool.,Department of Academic Rheumatology, Aintree University Hospital, Liverpool, UK
| | - Daniel H Solomon
- Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital.,Harvard Medical School.,Division of Pharmacoepidemiology and Pharmacoeconomics, Brigham and Women's Hospital
| | - Tianxi Cai
- Harvard Medical School.,Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Katherine P Liao
- Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital.,Harvard Medical School
| |
Collapse
|
83
|
Raghavan S, Ho YL, Kini V, Rhee MK, Vassy JL, Gagnon DR, Cho K, Wilson PWF, Phillips LS. Association Between Early Hypertension Control and Cardiovascular Disease Incidence in Veterans With Diabetes. Diabetes Care 2019; 42:1995-2003. [PMID: 31515207 PMCID: PMC6754236 DOI: 10.2337/dc19-0686] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Accepted: 07/26/2019] [Indexed: 02/03/2023]
Abstract
OBJECTIVE Guidelines for hypertension treatment in patients with diabetes diverge regarding the systolic blood pressure (SBP) threshold at which treatment should be initiated and treatment goal. We examined associations of early SBP treatment with atherosclerotic cardiovascular disease (ASCVD) events in U.S. adults with diabetes. RESEARCH DESIGN AND METHODS We studied 43,986 patients with diabetes who newly initiated antihypertensive therapy between 2002 and 2007. Patients were classified into categories based on SBP at treatment initiation (130-139 or ≥140 mmHg) and after 2 years of treatment (100-119, 120-129, 130-139, 140-159, and ≥160 mmHg). The primary outcome was composite ASCVD events (fatal and nonfatal myocardial infarction and stroke), estimated using inverse probability of treatment-weighted Poisson regression and multivariable Cox proportional hazards regression. RESULTS Relative to individuals who initiated treatment when SBP was 130-139 mmHg, those with pretreatment SBP ≥140 mmHg had higher ASCVD risk (hazard ratio 1.10 [95% CI 1.02, 1.19]). Relative to those with pretreatment SBP of 130-139 mmHg and on-treatment SBP of 120-129 mmHg (reference group), ASCVD incidence was higher in those with pretreatment SBP ≥140 mmHg and on-treatment SBP 120-129 mmHg (adjusted incidence rate difference [IRD] 1.0 [-0.2 to 2.1] events/1,000 person-years) and in those who achieved on-treatment SBP 130-139 mmHg (IRD 1.9 [0.6, 3.2] and 1.1 [0.04, 2.2] events/1,000 person-years for those with pretreatment SBP 130-139 mmHg and ≥140 mmHg, respectively). CONCLUSIONS In this observational study, patients with diabetes initiating antihypertensive therapy when SBP was 130-139 mmHg and those achieving on-treatment SBP <130 mmHg had better outcomes than those with higher SBP levels when initiating or after 2 years on treatment.
Collapse
Affiliation(s)
- Sridharan Raghavan
- Veterans Affairs Eastern Colorado Healthcare System, Aurora, CO
- Division of Hospital Medicine, University of Colorado School of Medicine, Aurora, CO
- Colorado Cardiovascular Outcomes Research Consortium, Aurora, CO
| | - Yuk-Lam Ho
- Veterans Affairs Boston Healthcare System, Boston, MA
| | - Vinay Kini
- Colorado Cardiovascular Outcomes Research Consortium, Aurora, CO
- Division of Cardiology, University of Colorado School of Medicine, Aurora, CO
| | - Mary K Rhee
- Atlanta Veterans Affairs Medical Center, Decatur, GA
- Division of Endocrinology, Metabolism, and Lipids, Emory University School of Medicine, Atlanta, GA
| | - Jason L Vassy
- Veterans Affairs Boston Healthcare System, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, MA
| | - David R Gagnon
- Veterans Affairs Boston Healthcare System, Boston, MA
- Department of Biostatistics, Boston University School of Public Health, Boston, MA
| | - Kelly Cho
- Veterans Affairs Boston Healthcare System, Boston, MA
- Department of Medicine, Harvard Medical School, Boston, MA
| | - Peter W F Wilson
- Atlanta Veterans Affairs Medical Center, Decatur, GA
- Division of Cardiology, Emory University School of Medicine, Atlanta, GA
| | - Lawrence S Phillips
- Atlanta Veterans Affairs Medical Center, Decatur, GA
- Division of Endocrinology, Metabolism, and Lipids, Emory University School of Medicine, Atlanta, GA
| |
Collapse
|