1
|
McGuire D, Markus H, Yang L, Xu J, Montgomery A, Berg A, Li Q, Carrel L, Liu DJ, Jiang B. Dissecting heritability, environmental risk, and air pollution causal effects using > 50 million individuals in MarketScan. Nat Commun 2024; 15:5357. [PMID: 38918381 PMCID: PMC11199552 DOI: 10.1038/s41467-024-49566-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 06/10/2024] [Indexed: 06/27/2024] Open
Abstract
Large national-level electronic health record (EHR) datasets offer new opportunities for disentangling the role of genes and environment through deep phenotype information and approximate pedigree structures. Here we use the approximate geographical locations of patients as a proxy for spatially correlated community-level environmental risk factors. We develop a spatial mixed linear effect (SMILE) model that incorporates both genetics and environmental contribution. We extract EHR and geographical locations from 257,620 nuclear families and compile 1083 disease outcome measurements from the MarketScan dataset. We augment the EHR with publicly available environmental data, including levels of particulate matter 2.5 (PM2.5), nitrogen dioxide (NO2), climate, and sociodemographic data. We refine the estimates of genetic heritability and quantify community-level environmental contributions. We also use wind speed and direction as instrumental variables to assess the causal effects of air pollution. In total, we find PM2.5 or NO2 have statistically significant causal effects on 135 diseases, including respiratory, musculoskeletal, digestive, metabolic, and sleep disorders, where PM2.5 and NO2 tend to affect biologically distinct disease categories. These analyses showcase several robust strategies for jointly modeling genetic and environmental effects on disease risk using large EHR datasets and will benefit upcoming biobank studies in the era of precision medicine.
Collapse
Affiliation(s)
- Daniel McGuire
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA, 17033, USA
| | - Havell Markus
- MD/PhD Program, Penn State College of Medicine of Medicine, Hershey, PA, 17033, USA
- Bioinformatics and Genomics PhD Program, Penn State College of Medicine, Hershey, PA, 17033, USA
- Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, 17033, USA
| | - Lina Yang
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA, 17033, USA
| | - Jingyu Xu
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA, 17033, USA
| | - Austin Montgomery
- MD/PhD Program, Penn State College of Medicine of Medicine, Hershey, PA, 17033, USA
| | - Arthur Berg
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA, 17033, USA
| | - Qunhua Li
- Department of Statistics, Penn State University, University Park, PA, USA
| | - Laura Carrel
- Department of Biochemistry and Molecular Biology, Penn State College of Medicine, Hershey, PA, 17033, USA
| | - Dajiang J Liu
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA, 17033, USA.
| | - Bibo Jiang
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, PA, 17033, USA.
| |
Collapse
|
2
|
Bazoge A, Morin E, Daille B, Gourraud PA. Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review. JMIR Med Inform 2023; 11:e42477. [PMID: 38100200 PMCID: PMC10757232 DOI: 10.2196/42477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 01/16/2023] [Accepted: 09/07/2023] [Indexed: 12/17/2023] Open
Abstract
BACKGROUND In recent years, health data collected during the clinical care process have been often repurposed for secondary use through clinical data warehouses (CDWs), which interconnect disparate data from different sources. A large amount of information of high clinical value is stored in unstructured text format. Natural language processing (NLP), which implements algorithms that can operate on massive unstructured textual data, has the potential to structure the data and make clinical information more accessible. OBJECTIVE The aim of this review was to provide an overview of studies applying NLP to textual data from CDWs. It focuses on identifying the (1) NLP tasks applied to data from CDWs and (2) NLP methods used to tackle these tasks. METHODS This review was performed according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. We searched for relevant articles in 3 bibliographic databases: PubMed, Google Scholar, and ACL Anthology. We reviewed the titles and abstracts and included articles according to the following inclusion criteria: (1) focus on NLP applied to textual data from CDWs, (2) articles published between 1995 and 2021, and (3) written in English. RESULTS We identified 1353 articles, of which 194 (14.34%) met the inclusion criteria. Among all identified NLP tasks in the included papers, information extraction from clinical text (112/194, 57.7%) and the identification of patients (51/194, 26.3%) were the most frequent tasks. To address the various tasks, symbolic methods were the most common NLP methods (124/232, 53.4%), showing that some tasks can be partially achieved with classical NLP techniques, such as regular expressions or pattern matching that exploit specialized lexica, such as drug lists and terminologies. Machine learning (70/232, 30.2%) and deep learning (38/232, 16.4%) have been increasingly used in recent years, including the most recent approaches based on transformers. NLP methods were mostly applied to English language data (153/194, 78.9%). CONCLUSIONS CDWs are central to the secondary use of clinical texts for research purposes. Although the use of NLP on data from CDWs is growing, there remain challenges in this field, especially with regard to languages other than English. Clinical NLP is an effective strategy for accessing, extracting, and transforming data from CDWs. Information retrieved with NLP can assist in clinical research and have an impact on clinical practice.
Collapse
Affiliation(s)
- Adrien Bazoge
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
- Nantes Université, CHU de Nantes, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, INSERM, CIC 1413, F-44000 Nantes, France
| | - Emmanuel Morin
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
| | - Béatrice Daille
- Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
| | - Pierre-Antoine Gourraud
- Nantes Université, CHU de Nantes, Pôle Hospitalo-Universitaire 11: Santé Publique, Clinique des données, INSERM, CIC 1413, F-44000 Nantes, France
- Nantes Université, INSERM, CHU de Nantes, École Centrale Nantes, Centre de Recherche Translationnelle en Transplantation et Immunologie, CR2TI, F-44000 Nantes, France
| |
Collapse
|
3
|
Cromer SJ, Chen V, Han C, Marshall W, Emongo S, Greaux E, Majarian T, Florez JC, Mercader J, Udler MS. Algorithmic identification of atypical diabetes in electronic health record (EHR) systems. PLoS One 2022; 17:e0278759. [PMID: 36508462 PMCID: PMC9744270 DOI: 10.1371/journal.pone.0278759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Accepted: 11/22/2022] [Indexed: 12/14/2022] Open
Abstract
AIMS Understanding atypical forms of diabetes (AD) may advance precision medicine, but methods to identify such patients are needed. We propose an electronic health record (EHR)-based algorithmic approach to identify patients who may have AD, specifically those with insulin-sufficient, non-metabolic diabetes, in order to improve feasibility of identifying these patients through detailed chart review. METHODS Patients with likely T2D were selected using a validated machine-learning (ML) algorithm applied to EHR data. "Typical" T2D cases were removed by excluding individuals with obesity, evidence of dyslipidemia, antibody-positive diabetes, or cystic fibrosis. To filter out likely type 1 diabetes (T1D) cases, we applied six additional "branch algorithms," relying on various clinical characteristics, which resulted in six overlapping cohorts. Diabetes type was classified by manual chart review as atypical, not atypical, or indeterminate due to missing information. RESULTS Of 114,975 biobank participants, the algorithms collectively identified 119 (0.1%) potential AD cases, of which 16 (0.014%) were confirmed after expert review. The branch algorithm that excluded T1D based on outpatient insulin use had the highest percentage yield of AD (13 of 27; 48.2% yield). Together, the 16 AD cases had significantly lower BMI and higher HDL than either unselected T1D or T2D cases identified by ML algorithms (P<0.05). Compared to the ML T1D group, the AD group had a significantly higher T2D polygenic score (P<0.01) and lower hemoglobin A1c (P<0.01). CONCLUSION Our EHR-based algorithms followed by manual chart review identified collectively 16 individuals with AD, representing 0.22% of biobank enrollees with T2D. With a maximum yield of 48% cases after manual chart review, our algorithms have the potential to drastically improve efficiency of AD identification. Recognizing patients with AD may inform on the heterogeneity of T2D and facilitate enrollment in studies like the Rare and Atypical Diabetes Network (RADIANT).
Collapse
Affiliation(s)
- Sara J. Cromer
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Department of Medicine, Harvard Medical School, Boston, Massachusetts, United States of America
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Northeastern University, Boston, Massachusetts, United States of America
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Victoria Chen
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Department of Medicine, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Christopher Han
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Department of Medicine, Harvard Medical School, Boston, Massachusetts, United States of America
| | - William Marshall
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Shekina Emongo
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Evelyn Greaux
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Tim Majarian
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Northeastern University, Boston, Massachusetts, United States of America
| | - Jose C. Florez
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Department of Medicine, Harvard Medical School, Boston, Massachusetts, United States of America
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Northeastern University, Boston, Massachusetts, United States of America
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Josep Mercader
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Department of Medicine, Harvard Medical School, Boston, Massachusetts, United States of America
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Northeastern University, Boston, Massachusetts, United States of America
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Miriam S. Udler
- Diabetes Unit, Endocrine Division, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Department of Medicine, Harvard Medical School, Boston, Massachusetts, United States of America
- Programs in Metabolism and Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Northeastern University, Boston, Massachusetts, United States of America
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| |
Collapse
|
4
|
Olusanya MO, Ogunsakin RE, Ghai M, Adeleke MA. Accuracy of Machine Learning Classification Models for the Prediction of Type 2 Diabetes Mellitus: A Systematic Survey and Meta-Analysis Approach. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph192114280. [PMID: 36361161 PMCID: PMC9655196 DOI: 10.3390/ijerph192114280] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 10/22/2022] [Accepted: 10/25/2022] [Indexed: 05/13/2023]
Abstract
Soft-computing and statistical learning models have gained substantial momentum in predicting type 2 diabetes mellitus (T2DM) disease. This paper reviews recent soft-computing and statistical learning models in T2DM using a meta-analysis approach. We searched for papers using soft-computing and statistical learning models focused on T2DM published between 2010 and 2021 on three different search engines. Of 1215 studies identified, 34 with 136952 patients met our inclusion criteria. The pooled algorithm's performance was able to predict T2DM with an overall accuracy of 0.86 (95% confidence interval [CI] of [0.82, 0.89]). The classification of diabetes prediction was significantly greater in models with a screening and diagnosis (pooled proportion [95% CI] = 0.91 [0.74, 0.97]) when compared to models with nephropathy (pooled proportion = 0.48 [0.76, 0.89] to 0.88 [0.83, 0.91]). For the prediction of T2DM, the decision trees (DT) models had a pooled accuracy of 0.88 [95% CI: 0.82, 0.92], and the neural network (NN) models had a pooled accuracy of 0.85 [95% CI: 0.79, 0.89]. Meta-regression did not provide any statistically significant findings for the heterogeneous accuracy in studies with different diabetes predictions, sample sizes, and impact factors. Additionally, ML models showed high accuracy for the prediction of T2DM. The predictive accuracy of ML algorithms in T2DM is promising, mainly through DT and NN models. However, there is heterogeneity among ML models. We compared the results and models and concluded that this evidence might help clinicians interpret data and implement optimum models for their dataset for T2DM prediction.
Collapse
Affiliation(s)
- Micheal O. Olusanya
- Department of Computer Science and Information Technology, Sol Plaatje University, Kimberley 8300, South Africa
- Correspondence:
| | - Ropo Ebenezer Ogunsakin
- Biostatistics Unit, Discipline of Public Health Medicine, School of Nursing & Public Health, College of Health Sciences, University of KwaZulu-Natal, Durban 4000, South Africa
| | - Meenu Ghai
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Durban 4000, South Africa
| | - Matthew Adekunle Adeleke
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Durban 4000, South Africa
| |
Collapse
|
5
|
Application of machine learning methods for the prediction of true fasting status in patients performing blood tests. Sci Rep 2022; 12:11929. [PMID: 35831336 PMCID: PMC9279373 DOI: 10.1038/s41598-022-15161-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Accepted: 06/20/2022] [Indexed: 11/28/2022] Open
Abstract
The fasting blood glucose (FBG) values extracted from electronic medical records (EMR) are assumed valid in existing research, which may cause diagnostic bias due to misclassification of fasting status. We proposed a machine learning (ML) algorithm to predict the fasting status of blood samples. This cross-sectional study was conducted using the EMR of a medical center from 2003 to 2018 and a total of 2,196,833 ontological FBGs from the outpatient service were enrolled. The theoretical true fasting status are identified by comparing the values of ontological FBG with average glucose levels derived from concomitant tested HbA1c based on multi-criteria. In addition to multiple logistic regression, we extracted 67 features to predict the fasting status by eXtreme Gradient Boosting (XGBoost). The discrimination and calibration of the prediction models were also assessed. Real-world performance was gauged by the prevalence of ineffective glucose measurement (IGM). Of the 784,340 ontologically labeled fasting samples, 77.1% were considered theoretical FBGs. The median (IQR) glucose and HbA1c level of ontological and theoretical fasting samples in patients without diabetes mellitus (DM) were 94.0 (87.0, 102.0) mg/dL and 5.6 (5.4, 5.9)%, and 92.0 (86.0, 99.0) mg/dL and 5.6 (5.4, 5.9)%, respectively. The XGBoost showed comparable calibration and AUROC of 0.887 than that of 0.868 in multiple logistic regression in the parsimonious approach and identified important predictors of glucose level, home-to-hospital distance, age, and concomitantly serum creatinine and lipid testing. The prevalence of IGM dropped from 27.8% based on ontological FBGs to 0.48% by using algorithm-verified FBGs. The proposed ML algorithm or multiple logistic regression model aids in verification of the fasting status.
Collapse
|
6
|
Wang S, Song F, Qiao Q, Liu Y, Chen J, Ma J. A Comparative Study of Natural Language Processing Algorithms Based on Cities Changing Diabetes Vulnerability Data. Healthcare (Basel) 2022; 10:healthcare10061119. [PMID: 35742169 PMCID: PMC9223144 DOI: 10.3390/healthcare10061119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 06/08/2022] [Accepted: 06/13/2022] [Indexed: 11/16/2022] Open
Abstract
(1) Background: Poor adherence to management behaviors in Chinese Type 2 diabetes mellitus (T2DM) patients leads to an uncontrolled prognosis of diabetes, which results in significant economic costs for China. It is imperative to quickly locate vulnerability factors in the management behavior of patients with T2DM. (2) Methods: In this study, a thematic analysis of the collected interview materials was conducted to construct the themes of T2DM management vulnerability. We explored the applicability of the pre-trained models based on the evaluation metrics in text classification. (3) Results: We constructed 12 themes of vulnerability related to the health and well-being of people with T2DM in Tianjin. We considered that Bidirectional Encoder Representation from Transformers (BERT) performed better in this Natural Language Processing (NLP) task with a shorter completion time. With the splitting ratio of 6:3:1 and batch size of 64 for BERT, the test accuracy was 97.71%, the completion time was 10 min 24 s, and the macro-F1 score was 0.9752. (4) Conclusions: Our results proved the applicability of NLP techniques in this specific Chinese-language medical environment. We filled the knowledge gap in the application of NLP technologies in diabetes management. Our study provided strong support for using NLP techniques to rapidly locate vulnerability factors in T2DM management.
Collapse
|
7
|
Li J, Xu Z, Xu T, Lin S. Predicting Diabetes in Patients with Metabolic Syndrome Using Machine-Learning Model Based on Multiple Years' Data. Diabetes Metab Syndr Obes 2022; 15:2951-2961. [PMID: 36186938 PMCID: PMC9525025 DOI: 10.2147/dmso.s381146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 09/16/2022] [Indexed: 11/23/2022] Open
Abstract
PURPOSE To evaluate the performance of machine-learning models based on multiple years of continuous data to predict incident diabetes among patients with metabolic syndrome. PATIENTS AND METHODS The dataset comprises the health records from 2008 to 2020 including 4510 nondiabetic participants with metabolic syndrome (MetS) at baseline and with at least 6 years of records. MetS was defined according to the International Diabetes Federation (IDF) criteria. Overall, 332 patients developed incident diabetes during the 7±1.4 years of follow-up. Three popular classification algorithms were evaluated on the dataset: logistic regression, random forest, and Xgboost. Five models including single-year models (year 1, year 2, and year 3) and multiple-year models (year 1-2 and year 1-3) were developed for each algorithm. RESULTS The model performances improved with the increasing longitudinal dataset as the area under the receiver operating characteristic curve (AUROC) was boosted for both random forest (year 1-3: AUROC=0.893; year 3: AUROC=0.862; year 1-2: AUROC=0.847; year 2: AUROC=0.838) and Xgboost (year 1-3: AUROC=0.897; year 3: AUROC=0.833; year 1-2: AUROC=0.856; year 2: AUROC=0.823) model. In the multiple-year models, the highest fasting plasma glucose, followed by the mean or lowest level of HbA1c and BMI had the most important predictive value for the onset of diabetes. In the "1-3" year model, "delta weight" which reflects the fluctuations of yearly change of weight was the fourth-most important feature. CONCLUSION This study demonstrated improved performance with the accumulation of longitudinal data when using machine learning for diabetes prediction in MetS patients. For individuals with similar clinical parameters, the variation trends of these parameters could change the risk of future diabetes. This result indicated that models based on longitudinal multiple years' data may provide more personalized assessment tools for risk evaluation.
Collapse
Affiliation(s)
- Jing Li
- Department of Health Management, Peking Union Medical College Hospital, Beijing, People’s Republic of China
| | - Zheng Xu
- Department of AI Research, Digital Health China Technologies Co. Ltd, Beijing, People’s Republic of China
| | - Tengda Xu
- Department of Health Management, Peking Union Medical College Hospital, Beijing, People’s Republic of China
| | - Songbai Lin
- Department of Health Management, Peking Union Medical College Hospital, Beijing, People’s Republic of China
- Correspondence: Songbai Lin, Department of Health Management, Peking Union Medical College Hospital, 1# Shuaifuyuan, Dongcheng District, Beijing, 100730, People’s Republic of China, Tel +86 10 6915 9901, Fax +86 10 6915 9901, Email
| |
Collapse
|
8
|
Brady V, Whisenant M, Wang X, Ly VK, Zhu G, Aguilar D, Wu H. Characterization of Symptoms and Symptom Clusters for Type 2 Diabetes Using a Large Nationwide Electronic Health Record Database. Diabetes Spectr 2022; 35:159-170. [PMID: 35668892 PMCID: PMC9160545 DOI: 10.2337/ds21-0064] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
OBJECTIVE A variety of symptoms may be associated with type 2 diabetes and its complications. Symptoms in chronic diseases may be described in terms of prevalence, severity, and trajectory and often co-occur in groups, known as symptom clusters, which may be representative of a common etiology. The purpose of this study was to characterize type 2 diabetes-related symptoms using a large nationwide electronic health record (EHR) database. METHODS We acquired the Cerner Health Facts, a nationwide EHR database. The type 2 diabetes cohort (n = 1,136,301 patients) was identified using a rule-based phenotype method. A multistep procedure was then used to identify type 2 diabetes-related symptoms based on International Classification of Diseases, 9th and 10th revisions, diagnosis codes. Type 2 diabetes-related symptoms and co-occurring symptom clusters, including their temporal patterns, were characterized based the longitudinal EHR data. RESULTS Patients had a mean age of 61.4 years, 51.2% were female, and 70.0% were White. Among 1,136,301 patients, there were 8,008,276 occurrences of 59 symptoms. The most frequently reported symptoms included pain, heartburn, shortness of breath, fatigue, and swelling, which occurred in 21-60% of the patients. We also observed over-represented type 2 diabetes symptoms, including difficulty speaking, feeling confused, trouble remembering, weakness, and drowsiness/sleepiness. Some of these are rare and difficult to detect by traditional patient-reported outcomes studies. CONCLUSION To the best of our knowledge, this is the first study to use a nationwide EHR database to characterize type 2 diabetes-related symptoms and their temporal patterns. Fifty-nine symptoms, including both over-represented and rare diabetes-related symptoms, were identified.
Collapse
Affiliation(s)
- Veronica Brady
- Cizik School of Nursing, The University of Texas Health Science Center at Houston, Houston, TX
| | - Meagan Whisenant
- Cizik School of Nursing, The University of Texas Health Science Center at Houston, Houston, TX
| | - Xueying Wang
- School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX
| | - Vi K. Ly
- School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX
| | - Gen Zhu
- School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX
| | - David Aguilar
- McGovern School of Medicine, The University of Texas Health Science Center at Houston, Houston, TX
| | - Hulin Wu
- School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX
- Corresponding author: Hulin Wu,
| |
Collapse
|
9
|
Lenoir KM, Wagenknecht LE, Divers J, Casanova R, Dabelea D, Saydah S, Pihoker C, Liese AD, Standiford D, Hamman R, Wells BJ. Determining diagnosis date of diabetes using structured electronic health record (EHR) data: the SEARCH for diabetes in youth study. BMC Med Res Methodol 2021; 21:210. [PMID: 34629073 PMCID: PMC8502379 DOI: 10.1186/s12874-021-01394-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 09/07/2021] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Disease surveillance of diabetes among youth has relied mainly upon manual chart review. However, increasingly available structured electronic health record (EHR) data have been shown to yield accurate determinations of diabetes status and type. Validated algorithms to determine date of diabetes diagnosis are lacking. The objective of this work is to validate two EHR-based algorithms to determine date of diagnosis of diabetes. METHODS A rule-based ICD-10 algorithm identified youth with diabetes from structured EHR data over the period of 2009 through 2017 within three children's hospitals that participate in the SEARCH for Diabetes in Youth Study: Cincinnati Children's Hospital, Cincinnati, OH, Seattle Children's Hospital, Seattle, WA, and Children's Hospital Colorado, Denver, CO. Previous research and a multidisciplinary team informed the creation of two algorithms based upon structured EHR data to determine date of diagnosis among diabetes cases. An ICD-code algorithm was defined by the year of occurrence of a second ICD-9 or ICD-10 diabetes code. A multiple-criteria algorithm consisted of the year of first occurrence of any of the following: diabetes-related ICD code, elevated glucose, elevated HbA1c, or diabetes medication. We assessed algorithm performance by percent agreement with a gold standard date of diagnosis determined by chart review. RESULTS Among 3777 cases, both algorithms demonstrated high agreement with true diagnosis year and differed in classification (p = 0.006): 86.5% agreement for the ICD code algorithm and 85.9% agreement for the multiple-criteria algorithm. Agreement was high for both type 1 and type 2 cases for the ICD code algorithm. Performance improved over time. CONCLUSIONS Year of occurrence of the second ICD diabetes-related code in the EHR yields an accurate diagnosis date within these pediatric hospital systems. This may lead to increased efficiency and sustainability of surveillance methods for incidence of diabetes among youth.
Collapse
Affiliation(s)
- Kristin M Lenoir
- Department of Biostatistics and Data Science, Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA.
- Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA.
| | - Lynne E Wagenknecht
- Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA
| | - Jasmin Divers
- Division of Health Services Research, NYU Winthrop Research Institute, NYU Long Island School of Medicine, Mineola, NY, USA
| | - Ramon Casanova
- Department of Biostatistics and Data Science, Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA
- Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA
| | - Dana Dabelea
- Department of Epidemiology, Colorado School of Public Health, University of Colorado Denver, Aurora, CO, USA
| | - Sharon Saydah
- Division of Diabetes Translation, National Center for Chronic Disease Prevention and Health Promotion, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Catherine Pihoker
- Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Angela D Liese
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA
| | - Debra Standiford
- Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Richard Hamman
- Department of Epidemiology, Colorado School of Public Health, University of Colorado Denver, Aurora, CO, USA
| | - Brian J Wells
- Department of Biostatistics and Data Science, Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA
- Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA
| |
Collapse
|
10
|
Turchin A, Florez Builes LF. Using Natural Language Processing to Measure and Improve Quality of Diabetes Care: A Systematic Review. J Diabetes Sci Technol 2021; 15:553-560. [PMID: 33736486 PMCID: PMC8120048 DOI: 10.1177/19322968211000831] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
BACKGROUND Real-world evidence research plays an increasingly important role in diabetes care. However, a large fraction of real-world data are "locked" in narrative format. Natural language processing (NLP) technology offers a solution for analysis of narrative electronic data. METHODS We conducted a systematic review of studies of NLP technology focused on diabetes. Articles published prior to June 2020 were included. RESULTS We included 38 studies in the analysis. The majority (24; 63.2%) described only development of NLP tools; the remainder used NLP tools to conduct clinical research. A large fraction (17; 44.7%) of studies focused on identification of patients with diabetes; the rest covered a broad range of subjects that included hypoglycemia, lifestyle counseling, diabetic kidney disease, insulin therapy and others. The mean F1 score for all studies where it was available was 0.882. It tended to be lower (0.817) in studies of more linguistically complex concepts. Seven studies reported findings with potential implications for improving delivery of diabetes care. CONCLUSION Research in NLP technology to study diabetes is growing quickly, although challenges (e.g. in analysis of more linguistically complex concepts) remain. Its potential to deliver evidence on treatment and improving quality of diabetes care is demonstrated by a number of studies. Further growth in this area would be aided by deeper collaboration between developers and end-users of natural language processing tools as well as by broader sharing of the tools themselves and related resources.
Collapse
Affiliation(s)
- Alexander Turchin
- Brigham and Women’s Hospital, Boston,
MA, USA
- Alexander Turchin, MD, MS, Brigham and
Women’s Hospital, 221 Longwood Avenue, Boston, MA 02115, USA.
| | | |
Collapse
|
11
|
Lee S, Doktorchik C, Martin EA, D'Souza AG, Eastwood C, Shaheen AA, Naugler C, Lee J, Quan H. Electronic Medical Record-Based Case Phenotyping for the Charlson Conditions: Scoping Review. JMIR Med Inform 2021; 9:e23934. [PMID: 33522976 PMCID: PMC7884219 DOI: 10.2196/23934] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 11/20/2020] [Accepted: 12/05/2020] [Indexed: 12/16/2022] Open
Abstract
Background Electronic medical records (EMRs) contain large amounts of rich clinical information. Developing EMR-based case definitions, also known as EMR phenotyping, is an active area of research that has implications for epidemiology, clinical care, and health services research. Objective This review aims to describe and assess the present landscape of EMR-based case phenotyping for the Charlson conditions. Methods A scoping review of EMR-based algorithms for defining the Charlson comorbidity index conditions was completed. This study covered articles published between January 2000 and April 2020, both inclusive. Embase (Excerpta Medica database) and MEDLINE (Medical Literature Analysis and Retrieval System Online) were searched using keywords developed in the following 3 domains: terms related to EMR, terms related to case finding, and disease-specific terms. The manuscript follows the Preferred Reporting Items for Systematic reviews and Meta-analyses extension for Scoping Reviews (PRISMA) guidelines. Results A total of 274 articles representing 299 algorithms were assessed and summarized. Most studies were undertaken in the United States (181/299, 60.5%), followed by the United Kingdom (42/299, 14.0%) and Canada (15/299, 5.0%). These algorithms were mostly developed either in primary care (103/299, 34.4%) or inpatient (168/299, 56.2%) settings. Diabetes, congestive heart failure, myocardial infarction, and rheumatology had the highest number of developed algorithms. Data-driven and clinical rule–based approaches have been identified. EMR-based phenotype and algorithm development reflect the data access allowed by respective health systems, and algorithms vary in their performance. Conclusions Recognizing similarities and differences in health systems, data collection strategies, extraction, data release protocols, and existing clinical pathways is critical to algorithm development strategies. Several strategies to assist with phenotype-based case definitions have been proposed.
Collapse
Affiliation(s)
- Seungwon Lee
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Alberta Health Services, Calgary, AB, Canada.,Data Intelligence for Health Lab, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Chelsea Doktorchik
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Elliot Asher Martin
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Alberta Health Services, Calgary, AB, Canada
| | - Adam Giles D'Souza
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Alberta Health Services, Calgary, AB, Canada
| | - Cathy Eastwood
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Abdel Aziz Shaheen
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Medicine, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Christopher Naugler
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Pathology and Laboratory Medicine, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Joon Lee
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Data Intelligence for Health Lab, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Cardiac Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Hude Quan
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.,Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| |
Collapse
|
12
|
Weber C, Röschke L, Modersohn L, Lohr C, Kolditz T, Hahn U, Ammon D, Betz B, Kiehntopf M. Optimized Identification of Advanced Chronic Kidney Disease and Absence of Kidney Disease by Combining Different Electronic Health Data Resources and by Applying Machine Learning Strategies. J Clin Med 2020; 9:jcm9092955. [PMID: 32932685 PMCID: PMC7563476 DOI: 10.3390/jcm9092955] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 08/26/2020] [Accepted: 08/28/2020] [Indexed: 12/31/2022] Open
Abstract
Automated identification of advanced chronic kidney disease (CKD ≥ III) and of no known kidney disease (NKD) can support both clinicians and researchers. We hypothesized that identification of CKD and NKD can be improved, by combining information from different electronic health record (EHR) resources, comprising laboratory values, discharge summaries and ICD-10 billing codes, compared to using each component alone. We included EHRs from 785 elderly multimorbid patients, hospitalized between 2010 and 2015, that were divided into a training and a test (n = 156) dataset. We used both the area under the receiver operating characteristic (AUROC) and under the precision-recall curve (AUCPR) with a 95% confidence interval for evaluation of different classification models. In the test dataset, the combination of EHR components as a simple classifier identified CKD ≥ III (AUROC 0.96[0.93-0.98]) and NKD (AUROC 0.94[0.91-0.97]) better than laboratory values (AUROC CKD 0.85[0.79-0.90], NKD 0.91[0.87-0.94]), discharge summaries (AUROC CKD 0.87[0.82-0.92], NKD 0.84[0.79-0.89]) or ICD-10 billing codes (AUROC CKD 0.85[0.80-0.91], NKD 0.77[0.72-0.83]) alone. Logistic regression and machine learning models improved recognition of CKD ≥ III compared to the simple classifier if only laboratory values were used (AUROC 0.96[0.92-0.99] vs. 0.86[0.81-0.91], p < 0.05) and improved recognition of NKD if information from previous hospital stays was used (AUROC 0.99[0.98-1.00] vs. 0.95[0.92-0.97]], p < 0.05). Depending on the availability of data, correct automated identification of CKD ≥ III and NKD from EHRs can be improved by generating classification models based on the combination of different EHR components.
Collapse
Affiliation(s)
- Christoph Weber
- Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital, 07747 Jena, Germany; (C.W.); (L.R.)
| | - Lena Röschke
- Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital, 07747 Jena, Germany; (C.W.); (L.R.)
| | - Luise Modersohn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, 07743 Jena, Germany; (L.M.); (C.L.); (T.K.); (U.H.)
| | - Christina Lohr
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, 07743 Jena, Germany; (L.M.); (C.L.); (T.K.); (U.H.)
| | - Tobias Kolditz
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, 07743 Jena, Germany; (L.M.); (C.L.); (T.K.); (U.H.)
| | - Udo Hahn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich Schiller University Jena, 07743 Jena, Germany; (L.M.); (C.L.); (T.K.); (U.H.)
| | - Danny Ammon
- Data Integration Center, Jena University Hospital, 07743 Jena, Germany;
| | - Boris Betz
- Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital, 07747 Jena, Germany; (C.W.); (L.R.)
- Correspondence: (B.B.); (M.K.); Tel.: +49-3641-9-325074 (B.B.); +49-3641-9-325001 (M.K.)
| | - Michael Kiehntopf
- Department of Clinical Chemistry and Laboratory Diagnostics and Integrated Biobank Jena (IBBJ), Jena University Hospital, 07747 Jena, Germany; (C.W.); (L.R.)
- Correspondence: (B.B.); (M.K.); Tel.: +49-3641-9-325074 (B.B.); +49-3641-9-325001 (M.K.)
| |
Collapse
|
13
|
Kuo KM, Talley P, Kao Y, Huang CH. A multi-class classification model for supporting the diagnosis of type II diabetes mellitus. PeerJ 2020; 8:e9920. [PMID: 32974105 PMCID: PMC7487151 DOI: 10.7717/peerj.9920] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 08/20/2020] [Indexed: 12/21/2022] Open
Abstract
Background Numerous studies have utilized machine-learning techniques to predict the early onset of type 2 diabetes mellitus. However, fewer studies have been conducted to predict an appropriate diagnosis code for the type 2 diabetes mellitus condition. Further, ensemble techniques such as bagging and boosting have likewise been utilized to an even lesser extent. The present study aims to identify appropriate diagnosis codes for type 2 diabetes mellitus patients by means of building a multi-class prediction model which is both parsimonious and possessing minimum features. In addition, the importance of features for predicting diagnose code is provided. Methods This study included 149 patients who have contracted type 2 diabetes mellitus. The sample was collected from a large hospital in Taiwan from November, 2017 to May, 2018. Machine learning algorithms including instance-based, decision trees, deep neural network, and ensemble algorithms were all used to build the predictive models utilized in this study. Average accuracy, area under receiver operating characteristic curve, Matthew correlation coefficient, macro-precision, recall, weighted average of precision and recall, and model process time were subsequently used to assess the performance of the built models. Information gain and gain ratio were used in order to demonstrate feature importance. Results The results showed that most algorithms, except for deep neural network, performed well in terms of all performance indices regardless of either the training or testing dataset that were used. Ten features and their importance to determine the diagnosis code of type 2 diabetes mellitus were identified. Our proposed predictive model can be further developed into a clinical diagnosis support system or integrated into existing healthcare information systems. Both methods of application can effectively support physicians whenever they are diagnosing type 2 diabetes mellitus patients in order to foster better patient-care planning.
Collapse
Affiliation(s)
- Kuang-Ming Kuo
- Department of Healthcare Administration, I-Shou University, Kaohsiung City, Taiwan, Republic of China
| | - Paul Talley
- Department of Applied English, I-Shou University, Kaohsiung City, Taiwan, Republic of China
| | - YuHsi Kao
- Department of Endocrinology, E-Da Hospital, Kaohsiung City, Taiwan, Republic of China
| | - Chi Hsien Huang
- Department of Family Medicine, E-Da Hospital, I-Shou University, Kaohsiung City, Taiwan, Republic of China.,Department of Community Healthcare and Geriatrics, Nagoya University Graduate School of Medicine, Nagoya, Japan
| |
Collapse
|
14
|
Siontis KC, Yao X, Pirruccello JP, Philippakis AA, Noseworthy PA. How Will Machine Learning Inform the Clinical Care of Atrial Fibrillation? Circ Res 2020; 127:155-169. [DOI: 10.1161/circresaha.120.316401] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Machine learning applications in cardiology have rapidly evolved in the past decade. With the availability of machine learning tools coupled with vast data sources, the management of atrial fibrillation (AF), a common chronic disease with significant associated morbidity and socioeconomic impact, is undergoing a knowledge and practice transformation in the increasingly complex healthcare environment. Among other advances, deep-learning machine learning methods, including convolutional neural networks, have enabled the development of AF screening pathways using the ubiquitous 12-lead ECG to detect asymptomatic paroxysmal AF in at-risk populations (such as those with cryptogenic stroke), the refinement of AF and stroke prediction schemes through comprehensive digital phenotyping using structured and unstructured data abstraction from the electronic health record or wearable monitoring technologies, and the optimization of treatment strategies, ranging from stroke prophylaxis to monitoring of antiarrhythmic drug (AAD) therapy. Although the clinical and population-wide impact of these tools continues to be elucidated, such transformative progress does not come without challenges, such as the concerns about adopting black box technologies, assessing input data quality for training such models, and the risk of perpetuating rather than alleviating health disparities. This review critically appraises the advances of machine learning related to the care of AF thus far, their potential future directions, and its potential limitations and challenges.
Collapse
Affiliation(s)
| | - Xiaoxi Yao
- Robert D and Patricia E Kern Center for the Science of Health Care Delivery (X.Y.), Mayo Clinic, Rochester, MN
- Division of Health Care Policy and Research, Department of Health Sciences Research (X.Y.), Mayo Clinic, Rochester, MN
| | - James P. Pirruccello
- Broad Institute, Cambridge, MA (J.P.P., A.A.P.)
- Division of Cardiology, Massachusetts General Hospital, Boston (J.P.P.)
| | | | - Peter A. Noseworthy
- From the Department of Cardiovascular Medicine (K.C.S., P.A.N.), Mayo Clinic, Rochester, MN
| |
Collapse
|