1
|
van Assen M, Tariq A, Razavi AC, Yang C, Banerjee I, De Cecco CN. Fusion Modeling: Combining Clinical and Imaging Data to Advance Cardiac Care. Circ Cardiovasc Imaging 2023; 16:e014533. [PMID: 38073535 PMCID: PMC10754220 DOI: 10.1161/circimaging.122.014533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
In addition to the traditional clinical risk factors, an increasing amount of imaging biomarkers have shown value for cardiovascular risk prediction. Clinical and imaging data are captured from a variety of data sources during multiple patient encounters and are often analyzed independently. Initial studies showed that fusion of both clinical and imaging features results in superior prognostic performance compared with traditional scores. There are different approaches to fusion modeling, combining multiple data resources to optimize predictions, each with its own advantages and disadvantages. However, manual extraction of clinical and imaging data is time and labor intensive and often not feasible in clinical practice. An automated approach for clinical and imaging data extraction is highly desirable. Convolutional neural networks and natural language processing can be utilized for the extraction of electronic medical record data, imaging studies, and free-text data. This review outlines the current status of cardiovascular risk prediction and fusion modeling; and in addition gives an overview of different artificial intelligence approaches to automatically extract data from images and electronic medical records for this purpose.
Collapse
Affiliation(s)
- Marly van Assen
- Translational Laboratory for Cardiothoracic Imaging and Artificial Intelligence, Department of Radiology and Imaging Sciences, Emory University, Atlanta, GA, USA
| | - Amara Tariq
- Machine Intelligence in Medicine and Imaging (MI-2) Lab, Mayo Clinic, AZ, USA
| | - Alexander C. Razavi
- Translational Laboratory for Cardiothoracic Imaging and Artificial Intelligence, Department of Radiology and Imaging Sciences, Emory University, Atlanta, GA, USA
- Emory Clinical Cardiovascular Research Institute, Emory University, Atlanta, GA, USA
| | - Carl Yang
- Computer Science, Emory University, Atlanta, GA, USA
| | - Imon Banerjee
- Machine Intelligence in Medicine and Imaging (MI-2) Lab, Mayo Clinic, AZ, USA
| | - Carlo N. De Cecco
- Translational Laboratory for Cardiothoracic Imaging and Artificial Intelligence, Department of Radiology and Imaging Sciences, Emory University, Atlanta, GA, USA
- Division of Cardiothoracic Imaging, Department of Radiology and Imaging Sciences, Emory University, Atlanta, GA USA
| |
Collapse
|
2
|
Cai J, Chen S, Guo S, Wang S, Li L, Liu X, Zheng K, Liu Y, Chen S. RegEMR: a natural language processing system to automatically identify premature ovarian decline from Chinese electronic medical records. BMC Med Inform Decis Mak 2023; 23:126. [PMID: 37464410 PMCID: PMC10353087 DOI: 10.1186/s12911-023-02239-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2023] [Accepted: 07/13/2023] [Indexed: 07/20/2023] Open
Abstract
BACKGROUND The ovarian reserve is a reservoir for reproductive potential. In clinical practice, early detection and treatment of premature ovarian decline characterized by abnormal ovarian reserve tests is regarded as a critical measure to prevent infertility. However, the relevant data are typically stored in an unstructured format in a hospital's electronic medical record (EMR) system, and their retrieval requires tedious manual abstraction by domain experts. Computational tools are therefore needed to reduce the workload. METHODS We presented RegEMR, an artificial intelligence tool composed of a rule-based natural language processing (NLP) extractor and a knowledge-based disease scoring model, to automatize the screening procedure of premature ovarian decline using Chinese reproductive EMRs. We used regular expressions (REs) as a text mining method and explored whether REs automatically synthesized by the genetic programming-based online platform RegexGenerator + + could be as effective as manually formulated REs. We also investigated how the representativeness of the learning corpus affected the performance of machine-generated REs. Additionally, we translated the clinical diagnostic criteria into a programmable disease diagnostic model for disease scoring and risk stratification. Four hundred outpatient medical records were collected from a Chinese fertility center. Manual review served as the gold standard, and fivefold cross-validation was used for evaluation. RESULTS The overall F-score of manually built REs was 0.9444 (95% CI 0.9373 to 0.9515), with no significant difference (paired t test p > 0.05) compared with machine-generated REs that could be affected by training set sizes and annotation portions. The extractor performed effectively in automatically tracing the dynamic changes in hormone levels (F-score 0.9518-0.9884) and ultrasonographic measures (F-score 0.9472-0.9822). Applying the extracted information to the proposed diagnostic model, the program obtained an accuracy of 0.98 and a sensitivity of 0.93 in risk screening. For each specific disease, the automatic diagnosis in 76% of patients was consistent with that of the clinical diagnosis, and the kappa coefficient was 0.63. CONCLUSION A Chinese NLP system named RegEMR was developed to automatically identify high risk of early ovarian aging and diagnose related diseases from Chinese reproductive EMRs. We hope that this system can aid EMR-based data collection and clinical decision support in fertility centers.
Collapse
Affiliation(s)
- Jie Cai
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
| | - Shenglin Chen
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
| | - Siyun Guo
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
| | - Suidong Wang
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
| | - Lintong Li
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
| | - Xiaotong Liu
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
| | - Keming Zheng
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
| | - Yudong Liu
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China
| | - Shiling Chen
- Center for Reproductive Medicine, Department of Gynecology and Obstetrics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China.
| |
Collapse
|
3
|
Epstein RH, Jean YK, Dudaryk R, Freundlich RE, Walco JP, Mueller DA, Banks SE. Natural Language Mapping of Electrocardiogram Interpretations to a Standardized Ontology. Methods Inf Med 2021; 60:104-109. [PMID: 34610644 DOI: 10.1055/s-0041-1736312] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
BACKGROUND Interpretations of the electrocardiogram (ECG) are often prepared using software outside the electronic health record (EHR) and imported via an interface as a narrative note. Thus, natural language processing is required to create a computable representation of the findings. Challenges include misspellings, nonstandard abbreviations, jargon, and equivocation in diagnostic interpretations. OBJECTIVES Our objective was to develop an algorithm to reliably and efficiently extract such information and map it to the standardized ECG ontology developed jointly by the American Heart Association, the American College of Cardiology Foundation, and the Heart Rhythm Society. The algorithm was to be designed to be easily modifiable for use with EHRs and ECG reporting systems other than the ones studied. METHODS An algorithm using natural language processing techniques was developed in structured query language to extract and map quantitative and diagnostic information from ECG narrative reports to the cardiology societies' standardized ECG ontology. The algorithm was developed using a training dataset of 43,861 ECG reports and applied to a test dataset of 46,873 reports. RESULTS Accuracy, precision, recall, and the F1-measure were all 100% in the test dataset for the extraction of quantitative data (e.g., PR and QTc interval, atrial and ventricular heart rate). Performances for matches in each diagnostic category in the standardized ECG ontology were all above 99% in the test dataset. The processing speed was approximately 20,000 reports per minute. We externally validated the algorithm from another institution that used a different ECG reporting system and found similar performance. CONCLUSION The developed algorithm had high performance for creating a computable representation of ECG interpretations. Software and lookup tables are provided that can easily be modified for local customization and for use with other EHR and ECG reporting systems. This algorithm has utility for research and in clinical decision-support where incorporation of ECG findings is desired.
Collapse
Affiliation(s)
- Richard H Epstein
- Department of Anesthesiology, Perioperative Medicine and Pain Management, University of Miami Miller School of Medicine, Miami, Florida, United States
| | - Yuel-Kai Jean
- Department of Anesthesiology, Perioperative Medicine and Pain Management, University of Miami Miller School of Medicine, Miami, Florida, United States
| | - Roman Dudaryk
- Department of Anesthesiology, Perioperative Medicine and Pain Management, University of Miami Miller School of Medicine, Miami, Florida, United States
| | - Robert E Freundlich
- Anesthesiology Critical Care Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, United States
| | - Jeremy P Walco
- Anesthesiology Critical Care Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, United States
| | - Dorothee A Mueller
- Anesthesiology Critical Care Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, United States
| | - Shawn E Banks
- Department of Anesthesiology, Perioperative Medicine and Pain Management, University of Miami Miller School of Medicine, Miami, Florida, United States
| |
Collapse
|
4
|
Robinson JR, Wei WQ, Roden DM, Denny JC. Defining Phenotypes from Clinical Data to Drive Genomic Research. Annu Rev Biomed Data Sci 2018; 1:69-92. [PMID: 34109303 DOI: 10.1146/annurev-biodatasci-080917-013335] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The rise in available longitudinal patient information in electronic health records (EHRs) and their coupling to DNA biobanks has resulted in a dramatic increase in genomic research using EHR data for phenotypic information. EHRs have the benefit of providing a deep and broad data source of health-related phenotypes, including drug response traits, expanding the phenome available to researchers for discovery. The earliest efforts at repurposing EHR data for research involved manual chart review of limited numbers of patients but now typically involve applications of rule-based and machine learning algorithms operating on sometimes huge corpora for both genome-wide and phenome-wide approaches. We highlight here the current methods, impact, challenges, and opportunities for repurposing clinical data to define patient phenotypes for genomics discovery. Use of EHR data has proven a powerful method for elucidation of genomic influences on diseases, traits, and drug-response phenotypes and will continue to have increasing applications in large cohort studies.
Collapse
Affiliation(s)
- Jamie R Robinson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN.,Department of General Surgery, Vanderbilt University Medical Center, Nashville, TN
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN
| | - Dan M Roden
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN.,Department of Medicine, Vanderbilt University Medical Center, Nashville, TN.,Department of Pharmacology, Vanderbilt University Medical Center
| | - Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN.,Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
5
|
Ye Y, Larrat EP, Caffrey AR. Algorithms used to identify ventricular arrhythmias and sudden cardiac death in retrospective studies: a systematic literature review. Ther Adv Cardiovasc Dis 2017; 12:39-51. [PMID: 29224509 DOI: 10.1177/1753944717745493] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Drug-induced QT interval prolongation may increase the risk of sudden cardiac death or ventricular arrhythmias (SCD/VA), and therefore affects the safety profile of medications. Administrative databases can be used to inform pharmacoepidemiologic drug safety studies for such rare events. In order to compare event rates between studies, validated operational definitions of these events are needed. We conducted a systematic literature review in PubMed to identify algorithms for SCD/VA. Twenty-two studies were included in the review. Fifteen (68%) studies evaluated International Classification of Diseases, 9th revision (ICD-9) based medical data, of which six utilized a common, validated operational definition. This algorithm was based on principal hospitalization discharge diagnosis or first-listed emergency department visit diagnosis, with an average positive predictive value (PPV) of 85%. Four studies evaluated ICD-9 based death data, of which three utilized a common algorithm with an average PPV of 88%. Further validation of ICD, 10th revision algorithms are needed. In conclusion, we identified a validated algorithm for SCD/VA in medical data, as well as in death data. As such, to ensure comparability between new research and the existing literature, pharmacoepidemiologic research in this area should utilize common, validated algorithms, such as the ones identified in our review, to operationally define these events.
Collapse
Affiliation(s)
- Yizhou Ye
- University of Rhode Island, Kingston, RI, USA
| | | | - Aisling R Caffrey
- University of Rhode Island, 7 Greenhouse Rd Room 265B, Kingston, RI 02881, USA
| |
Collapse
|
6
|
Karnes JH, Shaffer CM, Cronin R, Bastarache L, Gaudieri S, James I, Pavlos R, Steiner H, Mosley JD, Mallal S, Denny JC, Phillips EJ, Roden DM. Influence of Human Leukocyte Antigen (HLA) Alleles and Killer Cell Immunoglobulin-Like Receptors (KIR) Types on Heparin-Induced Thrombocytopenia (HIT). Pharmacotherapy 2017; 37:1164-1171. [PMID: 28688202 PMCID: PMC5600645 DOI: 10.1002/phar.1983] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Heparin-induced thrombocytopenia (HIT) is an unpredictable, life-threatening, immune-mediated reaction to heparin. Variation in human leukocyte antigen (HLA) genes is now used to prevent immune-mediated adverse drug reactions. Combinations of HLA alleles and killer cell immunoglobulin-like receptors (KIR) are associated with multiple autoimmune diseases and infections. The objective of this study is to evaluate the association of HLA alleles and KIR types, alone or in the presence of different HLA ligands, with HIT. HIT cases and heparin-exposed controls were identified in BioVU, an electronic health record coupled to a DNA biobank. HLA sequencing and KIR type imputation using Illumina OMNI-Quad data were performed. Odds ratios for HLA alleles and KIR types and HLA*KIR interactions using conditional logistic regressions were determined in the overall population and by race/ethnicity. Analysis was restricted to KIR types and HLA alleles with a frequency greater than 0.01. The p values for HLA and KIR association were corrected by using a false discovery rate q<0.05 and HLA*KIR interactions were considered significant at p<0.05. Sixty-five HIT cases and 350 matched controls were identified. No statistical differences in baseline characteristics were observed between cases and controls. The HLA-DRB3*01:01 allele was significantly associated with HIT in the overall population (odds ratio 2.81 [1.57-5.02], p=2.1×10-4 , q=0.02) and in individuals with European ancestry, independent of other alleles. No KIR types were associated with HIT, although a significant interaction was observed between KIR2DS5 and the HLA-C1 KIR binding group (p=0.03). The HLA-DRB3*01:01 allele was identified as a potential risk factor for HIT. This class II HLA gene and allele represent biologically plausible candidates for influencing HIT pathogenesis. We found limited evidence of the role of KIR types in HIT pathogenesis. Replication and further study of the HLA-DRB3*01:01 association is necessary.
Collapse
Affiliation(s)
- Jason H Karnes
- Department of Pharmacy Practice and Science, University of Arizona College of Pharmacy, Tucson, AZ
- Sarver Heart Center, Tucson, AZ
| | - Christian M Shaffer
- Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Robert Cronin
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville
| | - Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville
| | - Silvana Gaudieri
- School of Anatomy, Physiology and Human Biology, University of Western Australia, Nedlands, Western Australia, Australia
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN
- Institute for Immunology and Infectious Diseases, Murdoch University, Murdoch, Western Australia, Australia
| | - Ian James
- Institute for Immunology and Infectious Diseases, Murdoch University, Murdoch, Western Australia, Australia
| | - Rebecca Pavlos
- Institute for Immunology and Infectious Diseases, Murdoch University, Murdoch, Western Australia, Australia
| | - Heidi Steiner
- Department of Pharmacy Practice and Science, University of Arizona College of Pharmacy, Tucson, AZ
| | - Jonathan D Mosley
- Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Simon Mallal
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN
- Institute for Immunology and Infectious Diseases, Murdoch University, Murdoch, Western Australia, Australia
- Department of Pathology, Microbiology and Immunology, Vanderbilt University School of Medicine, Nashville, TN
| | - Joshua C Denny
- Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville
| | - Elizabeth J Phillips
- Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
- Department of Pathology, Microbiology and Immunology, Vanderbilt University School of Medicine, Nashville, TN
| | - Dan M Roden
- Division of Clinical Pharmacology, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville
- Department of Pharmacology, Vanderbilt University School of Medicine, Nashville, TN
| |
Collapse
|
7
|
Cronin RM, Fabbri D, Denny JC, Rosenbloom ST, Jackson GP. A comparison of rule-based and machine learning approaches for classifying patient portal messages. Int J Med Inform 2017; 105:110-120. [PMID: 28750904 DOI: 10.1016/j.ijmedinf.2017.06.004] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Revised: 06/13/2017] [Accepted: 06/20/2017] [Indexed: 12/28/2022]
Abstract
OBJECTIVE Secure messaging through patient portals is an increasingly popular way that consumers interact with healthcare providers. The increasing burden of secure messaging can affect clinic staffing and workflows. Manual management of portal messages is costly and time consuming. Automated classification of portal messages could potentially expedite message triage and delivery of care. MATERIALS AND METHODS We developed automated patient portal message classifiers with rule-based and machine learning techniques using bag of words and natural language processing (NLP) approaches. To evaluate classifier performance, we used a gold standard of 3253 portal messages manually categorized using a taxonomy of communication types (i.e., main categories of informational, medical, logistical, social, and other communications, and subcategories including prescriptions, appointments, problems, tests, follow-up, contact information, and acknowledgement). We evaluated our classifiers' accuracies in identifying individual communication types within portal messages with area under the receiver-operator curve (AUC). Portal messages often contain more than one type of communication. To predict all communication types within single messages, we used the Jaccard Index. We extracted the variables of importance for the random forest classifiers. RESULTS The best performing approaches to classification for the major communication types were: logistic regression for medical communications (AUC: 0.899); basic (rule-based) for informational communications (AUC: 0.842); and random forests for social communications and logistical communications (AUCs: 0.875 and 0.925, respectively). The best performing classification approach of classifiers for individual communication subtypes was random forests for Logistical-Contact Information (AUC: 0.963). The Jaccard Indices by approach were: basic classifier, Jaccard Index: 0.674; Naïve Bayes, Jaccard Index: 0.799; random forests, Jaccard Index: 0.859; and logistic regression, Jaccard Index: 0.861. For medical communications, the most predictive variables were NLP concepts (e.g., Temporal_Concept, which maps to 'morning', 'evening' and Idea_or_Concept which maps to 'appointment' and 'refill'). For logistical communications, the most predictive variables contained similar numbers of NLP variables and words (e.g., Telephone mapping to 'phone', 'insurance'). For social and informational communications, the most predictive variables were words (e.g., social: 'thanks', 'much', informational: 'question', 'mean'). CONCLUSIONS This study applies automated classification methods to the content of patient portal messages and evaluates the application of NLP techniques on consumer communications in patient portal messages. We demonstrated that random forest and logistic regression approaches accurately classified the content of portal messages, although the best approach to classification varied by communication type. Words were the most predictive variables for classification of most communication types, although NLP variables were most predictive for medical communication types. As adoption of patient portals increases, automated techniques could assist in understanding and managing growing volumes of messages. Further work is needed to improve classification performance to potentially support message triage and answering.
Collapse
Affiliation(s)
- Robert M Cronin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, USA.
| | - Daniel Fabbri
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - S Trent Rosenbloom
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Gretchen Purcell Jackson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Pediatric Surgery, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
8
|
Kuo TT, Rao P, Maehara C, Doan S, Chaparro JD, Day ME, Farcas C, Ohno-Machado L, Hsu CN. Ensembles of NLP Tools for Data Element Extraction from Clinical Notes. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1880-1889. [PMID: 28269947 PMCID: PMC5333200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Natural Language Processing (NLP) is essential for concept extraction from narrative text in electronic health records (EHR). To extract numerous and diverse concepts, such as data elements (i.e., important concepts related to a certain medical condition), a plausible solution is to combine various NLP tools into an ensemble to improve extraction performance. However, it is unclear to what extent ensembles of popular NLP tools improve the extraction of numerous and diverse concepts. Therefore, we built an NLP ensemble pipeline to synergize the strength of popular NLP tools using seven ensemble methods, and to quantify the improvement in performance achieved by ensembles in the extraction of data elements for three very different cohorts. Evaluation results show that the pipeline can improve the performance of NLP tools, but there is high variability depending on the cohort.
Collapse
Affiliation(s)
| | | | | | - Son Doan
- University of California San Diego, La Jolla, CA
| | | | | | | | | | - Chun-Nan Hsu
- University of California San Diego, La Jolla, CA
| |
Collapse
|
9
|
Teixeira PL, Wei WQ, Cronin RM, Mo H, VanHouten JP, Carroll RJ, LaRose E, Bastarache LA, Rosenbloom ST, Edwards TL, Roden DM, Lasko TA, Dart RA, Nikolai AM, Peissig PL, Denny JC. Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals. J Am Med Inform Assoc 2016; 24:162-171. [PMID: 27497800 DOI: 10.1093/jamia/ocw071] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2015] [Revised: 04/03/2016] [Accepted: 04/07/2016] [Indexed: 12/11/2022] Open
Abstract
OBJECTIVE Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites. MATERIALS AND METHODS We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10 of the best-performing algorithms at the Marshfield Clinic. RESULTS Random forests using billing codes, medications, vitals, and concepts had the best performance with a median area under the receiver operator characteristic curve (AUC) of 0.976. Normalized sums of all 4 categories also performed well (0.959 AUC). The best non-NLP algorithm combined normalized ICD9 codes, medications, and blood pressure readings with a median AUC of 0.948. Blood pressure cutoffs or ICD9 code counts alone had AUCs of 0.854 and 0.908, respectively. Marshfield Clinic results were similar. CONCLUSION This work shows that billing codes or blood pressure readings alone yield good hypertension classification performance. However, even simple combinations of input categories improve performance. The most complex algorithms classified hypertension with excellent recall and precision.
Collapse
Affiliation(s)
- Pedro L Teixeira
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Robert M Cronin
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Huan Mo
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Jacob P VanHouten
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA.,Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Robert J Carroll
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Eric LaRose
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, 1000 N Oak Ave - ML8, Marshfield, WI 54449, USA
| | - Lisa A Bastarache
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - S Trent Rosenbloom
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA.,Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Todd L Edwards
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Dan M Roden
- Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA.,Department of Pharmacology, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Thomas A Lasko
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Richard A Dart
- Center for Human Genetics, Marshfield Clinic Research Foundation, 1000 N Oak Ave-MLR, Marshfield, WI 54449, USA
| | - Anne M Nikolai
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, 1000 N Oak Ave - ML8, Marshfield, WI 54449, USA
| | - Peggy L Peissig
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, 1000 N Oak Ave - ML8, Marshfield, WI 54449, USA
| | - Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA .,Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
| |
Collapse
|
10
|
Luo Y, Szolovits P. Efficient Queries of Stand-off Annotations for Natural Language Processing on Electronic Medical Records. BIOMEDICAL INFORMATICS INSIGHTS 2016; 8:29-38. [PMID: 27478379 PMCID: PMC4954589 DOI: 10.4137/bii.s38916] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/04/2016] [Revised: 06/13/2016] [Accepted: 06/22/2016] [Indexed: 11/07/2022]
Abstract
In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently retrieving such annotations that satisfy position constraints. Efficient storage and retrieval of stand-off annotations can facilitate tasks such as mapping unstructured text to electronic medical record ontologies. We first formulate this problem into the interval query problem, for which optimal query/update time is in general logarithm. We next perform a tight time complexity analysis on the basic interval tree query algorithm and show its nonoptimality when being applied to a collection of 13 query types from Allen’s interval algebra. We then study two closely related state-of-the-art interval query algorithms, proposed query reformulations, and augmentations to the second algorithm. Our proposed algorithm achieves logarithmic time stabbing-max query time complexity and solves the stabbing-interval query tasks on all of Allen’s relations in logarithmic time, attaining the theoretic lower bound. Updating time is kept logarithmic and the space requirement is kept linear at the same time. We also discuss interval management in external memory models and higher dimensions.
Collapse
Affiliation(s)
- Yuan Luo
- Assistant Professor, Department of Preventive Medicine, Northwestern University, Chicago, IL, USA
| | - Peter Szolovits
- Professor, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
11
|
Sauer BC, Jones BE, Globe G, Leng J, Lu CC, He T, Teng CC, Sullivan P, Zeng Q. Performance of a Natural Language Processing (NLP) Tool to Extract Pulmonary Function Test (PFT) Reports from Structured and Semistructured Veteran Affairs (VA) Data. EGEMS 2016; 4:1217. [PMID: 27376095 PMCID: PMC4909376 DOI: 10.13063/2327-9214.1217] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Introduction/Objective: Pulmonary function tests (PFTs) are objective estimates of lung function, but are not reliably stored within the Veteran Health Affairs data systems as structured data. The aim of this study was to validate the natural language processing (NLP) tool we developed—which extracts spirometric values and responses to bronchodilator administration—against expert review, and to estimate the number of additional spirometric tests identified beyond the structured data. Methods: All patients at seven Veteran Affairs Medical Centers with a diagnostic code for asthma Jan 1, 2006–Dec 31, 2012 were included. Evidence of spirometry with a bronchodilator challenge (BDC) was extracted from structured data as well as clinical documents. NLP’s performance was compared against a human reference standard using a random sample of 1,001 documents. Results: In the validation set NLP demonstrated a precision of 98.9 percent (95 percent confidence intervals (CI): 93.9 percent, 99.7 percent), recall of 97.8 percent (95 percent CI: 92.2 percent, 99.7 percent), and an F-measure of 98.3 percent for the forced vital capacity pre- and post pairs and precision of 100 percent (95 percent CI: 96.6 percent, 100 percent), recall of 100 percent (95 percent CI: 96.6 percent, 100 percent), and an F-measure of 100 percent for the forced expiratory volume in one second pre- and post pairs for bronchodilator administration. Application of the NLP increased the proportion identified with complete bronchodilator challenge by 25 percent. Discussion/Conclusion: This technology can improve identification of PFTs for epidemiologic research. Caution must be taken in assuming that a single domain of clinical data can completely capture the scope of a disease, treatment, or clinical test.
Collapse
Affiliation(s)
- Brian C Sauer
- Salt Lake IDEAS Center, Veteran Affairs; Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Barbara E Jones
- Salt Lake IDEAS Center, Veteran Affairs; Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | | | - Jianwei Leng
- Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Chao-Chin Lu
- Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Tao He
- Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Chia-Chen Teng
- Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah
| | - Patrick Sullivan
- Department of Pharmacy Practice, School of Pharmacy, Regis University
| | - Qing Zeng
- Salt Lake IDEAS Center, Veteran Affairs; Department of Biomedical Informatics, School of Medicine, University of Utah
| |
Collapse
|
12
|
Denny JC, Spickard A, Speltz PJ, Porier R, Rosenstiel DE, Powers JS. Using natural language processing to provide personalized learning opportunities from trainee clinical notes. J Biomed Inform 2015; 56:292-9. [PMID: 26070431 DOI: 10.1016/j.jbi.2015.06.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Revised: 06/01/2015] [Accepted: 06/03/2015] [Indexed: 12/20/2022]
Abstract
OBJECTIVE Assessment of medical trainee learning through pre-defined competencies is now commonplace in schools of medicine. We describe a novel electronic advisor system using natural language processing (NLP) to identify two geriatric medicine competencies from medical student clinical notes in the electronic medical record: advance directives (AD) and altered mental status (AMS). MATERIALS AND METHODS Clinical notes from third year medical students were processed using a general-purpose NLP system to identify biomedical concepts and their section context. The system analyzed these notes for relevance to AD or AMS and generated custom email alerts to students with embedded supplemental learning material customized to their notes. Recall and precision of the two advisors were evaluated by physician review. Students were given pre and post multiple choice question tests broadly covering geriatrics. RESULTS Of 102 students approached, 66 students consented and enrolled. The system sent 393 email alerts to 54 students (82%), including 270 for AD and 123 for AMS. Precision was 100% for AD and 93% for AMS. Recall was 69% for AD and 100% for AMS. Students mentioned ADs for 43 patients, with all mentions occurring after first having received an AD reminder. Students accessed educational links 34 times from the 393 email alerts. There was no difference in pre (mean 62%) and post (mean 60%) test scores. CONCLUSIONS The system effectively identified two educational opportunities using NLP applied to clinical notes and demonstrated a small change in student behavior. Use of electronic advisors such as these may provide a scalable model to assess specific competency elements and deliver educational opportunities.
Collapse
Affiliation(s)
- Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, United States; Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, United States.
| | - Anderson Spickard
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, United States; Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - Peter J Speltz
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - Renee Porier
- The Center for Quality Aging, Vanderbilt University School of Medicine, Nashville, TN, United States; Office of Health Sciences Education, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - Donna E Rosenstiel
- The Center for Quality Aging, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - James S Powers
- Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, United States; The Center for Quality Aging, Vanderbilt University School of Medicine, Nashville, TN, United States; The Meharry Consortium Geriatric Education Center, Meharry Medical Center, Nashville, TN, United States; The Tennessee Valley Geriatric Research Education and Clinical Center, Tennessee Valley Healthcare System, Nashville, TN, United States
| |
Collapse
|
13
|
Liu M, Hu Y, Tang B. Role of text mining in early identification of potential drug safety issues. Methods Mol Biol 2015; 1159:227-51. [PMID: 24788270 DOI: 10.1007/978-1-4939-0709-0_13] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Drugs are an important part of today's medicine, designed to treat, control, and prevent diseases; however, besides their therapeutic effects, drugs may also cause adverse effects that range from cosmetic to severe morbidity and mortality. To identify these potential drug safety issues early, surveillance must be conducted for each drug throughout its life cycle, from drug development to different phases of clinical trials, and continued after market approval. A major aim of pharmacovigilance is to identify the potential drug-event associations that may be novel in nature, severity, and/or frequency. Currently, the state-of-the-art approach for signal detection is through automated procedures by analyzing vast quantities of data for clinical knowledge. There exists a variety of resources for the task, and many of them are textual data that require text analytics and natural language processing to derive high-quality information. This chapter focuses on the utilization of text mining techniques in identifying potential safety issues of drugs from textual sources such as biomedical literature, consumer posts in social media, and narrative electronic medical records.
Collapse
Affiliation(s)
- Mei Liu
- Department of Computer Science, New Jersey Institute of Technology, University Heights, Newark, NJ, 07102, USA,
| | | | | |
Collapse
|
14
|
Karnes JH, Cronin RM, Rollin J, Teumer A, Pouplard C, Shaffer CM, Blanquicett C, Bowton EA, Cowan JD, Mosley JD, Van Driest SL, Weeke PE, Wells QS, Bakchoul T, Denny JC, Greinacher A, Gruel Y, Roden DM. A genome-wide association study of heparin-induced thrombocytopenia using an electronic medical record. Thromb Haemost 2014; 113:772-81. [PMID: 25503805 DOI: 10.1160/th14-08-0670] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Accepted: 10/27/2014] [Indexed: 12/20/2022]
Abstract
Heparin-induced thrombocytopenia (HIT) is an unpredictable, potentially catastrophic adverse effect of heparin treatment resulting from an immune response to platelet factor 4 (PF4)/heparin complexes. No genome-wide evaluations have been performed to identify potential genetic influences on HIT. Here, we performed a genome-wide association study (GWAS) and candidate gene study using HIT cases and controls identified using electronic medical records (EMRs) coupled to a DNA biobank and attempted to replicate GWAS associations in an independent cohort. We subsequently investigated influences of GWAS-associated single nucleotide polymorphisms (SNPs) on PF4/heparin antibodies in non-heparin treated individuals. In a recessive model, we observed significant SNP associations (odds ratio [OR] 18.52; 95% confidence interval [CI] 6.33-54.23; p=3.18×10(-9)) with HIT near the T-Cell Death-Associated Gene 8 (TDAG8). These SNPs are in linkage disequilibrium with a missense TDAG8 SNP. TDAG8 SNPs trended toward an association with HIT in replication analysis (OR 5.71; 0.47-69.22; p=0.17), and the missense SNP was associated with PF4/heparin antibody levels and positive PF4/heparin antibodies in non-heparin treated patients (OR 3.09; 1.14-8.13; p=0.02). In the candidate gene study, SNPs at HLA-DRA were nominally associated with HIT (OR 0.25; 0.15-0.44; p=2.06×10(-6)). Further study of TDAG8 and HLA-DRA SNPs is warranted to assess their influence on the risk of developing HIT.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Dan M Roden
- Dan M. Roden, MD, MRB4 1285B, 2215B Garland Avenue, Nashville, TN, 37232-0575, USA, Tel.: +1 615 322 0067, Fax: +1 615 343 4522, E-mail:
| |
Collapse
|
15
|
Rosenbloom ST, Harris P, Pulley J, Basford M, Grant J, DuBuisson A, Rothman RL. The Mid-South clinical Data Research Network. J Am Med Inform Assoc 2014; 21:627-32. [PMID: 24821742 PMCID: PMC4078290 DOI: 10.1136/amiajnl-2014-002745] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
The Mid-South Clinical Data Research Network (CDRN) encompasses three large health systems: (1) Vanderbilt Health System (VU) with electronic medical records for over 2 million patients, (2) the Vanderbilt Healthcare Affiliated Network (VHAN) which currently includes over 40 hospitals, hundreds of ambulatory practices, and over 3 million patients in the Mid-South, and (3) Greenway Medical Technologies, with access to 24 million patients nationally. Initial goals of the Mid-South CDRN include: (1) expansion of our VU data network to include the VHAN and Greenway systems, (2) developing data integration/interoperability across the three systems, (3) improving our current tools for extracting clinical data, (4) optimization of tools for collection of patient-reported data, and (5) expansion of clinical decision support. By 18 months, we anticipate our CDRN will robustly support projects in comparative effectiveness research, pragmatic clinical trials, and other key research areas and have the capacity to share data and health information technology tools nationally.
Collapse
Affiliation(s)
- S Trent Rosenbloom
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Paul Harris
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jill Pulley
- Office of Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA Office of Personalized Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Melissa Basford
- Office of Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jason Grant
- Vanderbilt Health Affiliated Network, Nashville, Tennessee, USA
| | | | - Russell L Rothman
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA Center for Health Services Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| |
Collapse
|
16
|
Bui DDA, Zeng-Treitler Q. Learning regular expressions for clinical text classification. J Am Med Inform Assoc 2014; 21:850-7. [PMID: 24578357 DOI: 10.1136/amiajnl-2013-002411] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVES Natural language processing (NLP) applications typically use regular expressions that have been developed manually by human experts. Our goal is to automate both the creation and utilization of regular expressions in text classification. METHODS We designed a novel regular expression discovery (RED) algorithm and implemented two text classifiers based on RED. The RED+ALIGN classifier combines RED with an alignment algorithm, and RED+SVM combines RED with a support vector machine (SVM) classifier. Two clinical datasets were used for testing and evaluation: the SMOKE dataset, containing 1091 text snippets describing smoking status; and the PAIN dataset, containing 702 snippets describing pain status. We performed 10-fold cross-validation to calculate accuracy, precision, recall, and F-measure metrics. In the evaluation, an SVM classifier was trained as the control. RESULTS The two RED classifiers achieved 80.9-83.0% in overall accuracy on the two datasets, which is 1.3-3% higher than SVM's accuracy (p<0.001). Similarly, small but consistent improvements have been observed in precision, recall, and F-measure when RED classifiers are compared with SVM alone. More significantly, RED+ALIGN correctly classified many instances that were misclassified by the SVM classifier (8.1-10.3% of the total instances and 43.8-53.0% of SVM's misclassifications). CONCLUSIONS Machine-generated regular expressions can be effectively used in clinical text classification. The regular expression-based classifier can be combined with other classifiers, like SVM, to improve classification performance.
Collapse
Affiliation(s)
- Duy Duc An Bui
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA VA Salt Lake City Health Care System, Salt Lake City, Utah, USA
| | - Qing Zeng-Treitler
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA VA Salt Lake City Health Care System, Salt Lake City, Utah, USA
| |
Collapse
|
17
|
Smith J, Denny J, Chen Q, Nian H, Spickard III A, Rosenbloom ST, Miller RA. Lessons learned from developing a drug evidence base to support pharmacovigilance. Appl Clin Inform 2013; 4:596-617. [PMID: 24454585 PMCID: PMC3885918 DOI: 10.4338/aci-2013-08-ra-0062] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2013] [Accepted: 11/06/2013] [Indexed: 12/14/2022] Open
Abstract
OBJECTIVE This work identified challenges associated with extraction and representation of medication-related information from publicly available electronic sources. METHODS We gained direct observational experience through creating and evaluating the Drug Evidence Base (DEB), a repository of drug indications and adverse effects (ADEs), and supplemented this through literature review. We extracted DEB content from the National Drug File Reference Terminology, from aggregated MEDLINE co-occurrence data, and from the National Library of Medicine's DailyMed. To understand better the similarities, differences and problems with the content of DEB and the SIDER Side Effect Resource, and Vanderbilt's MEDI Indication Resource, we carried out statistical evaluations and human expert reviews. RESULTS While DEB, SIDER, and MEDI often agreed on medication indications and side effects, cross-system shortcomings limit their current utility. The drug information resources we evaluated frequently employed multiple, disparate vaguely related UMLS concepts to represent a single specific clinical drug indication or adverse effect. Thus, evaluations comparing drug-indication and drug-ADE coverage for such resources will encounter substantial numbers of false negative and false positive matches. Furthermore, our review found that many indication and ADE relationships are too complex - logically and temporally - to represent within existing systems. CONCLUSION To enhance applicability and utility, future drug information systems deriving indications and ADEs from public resources must represent clinical concepts uniformly and as precisely as possible. Future systems must also better represent the inherent complexity of indications and ADEs.
Collapse
Affiliation(s)
- J.C. Smith
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
| | | | | | - H. Nian
- Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA;
4School of Nursing, Vanderbilt University, Nashville, Tennessee, USA
| | | | | | | |
Collapse
|
18
|
Suh KS, Sarojini S, Youssif M, Nalley K, Milinovikj N, Elloumi F, Russell S, Pecora A, Schecter E, Goy A. Tissue banking, bioinformatics, and electronic medical records: the front-end requirements for personalized medicine. JOURNAL OF ONCOLOGY 2013; 2013:368751. [PMID: 23818899 PMCID: PMC3683471 DOI: 10.1155/2013/368751] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2013] [Revised: 05/03/2013] [Accepted: 05/07/2013] [Indexed: 11/26/2022]
Abstract
Personalized medicine promises patient-tailored treatments that enhance patient care and decrease overall treatment costs by focusing on genetics and "-omics" data obtained from patient biospecimens and records to guide therapy choices that generate good clinical outcomes. The approach relies on diagnostic and prognostic use of novel biomarkers discovered through combinations of tissue banking, bioinformatics, and electronic medical records (EMRs). The analytical power of bioinformatic platforms combined with patient clinical data from EMRs can reveal potential biomarkers and clinical phenotypes that allow researchers to develop experimental strategies using selected patient biospecimens stored in tissue banks. For cancer, high-quality biospecimens collected at diagnosis, first relapse, and various treatment stages provide crucial resources for study designs. To enlarge biospecimen collections, patient education regarding the value of specimen donation is vital. One approach for increasing consent is to offer publically available illustrations and game-like engagements demonstrating how wider sample availability facilitates development of novel therapies. The critical value of tissue bank samples, bioinformatics, and EMR in the early stages of the biomarker discovery process for personalized medicine is often overlooked. The data obtained also require cross-disciplinary collaborations to translate experimental results into clinical practice and diagnostic and prognostic use in personalized medicine.
Collapse
Affiliation(s)
- K. Stephen Suh
- The Genomics and Biomarkers Program, The John Theurer Cancer Center at Hackensack, University Medical Center, D. Jurist Research Building, 40 Prospect Avenue, Hackensack, NJ 07601, USA
| | - Sreeja Sarojini
- The Genomics and Biomarkers Program, The John Theurer Cancer Center at Hackensack, University Medical Center, D. Jurist Research Building, 40 Prospect Avenue, Hackensack, NJ 07601, USA
| | - Maher Youssif
- The Genomics and Biomarkers Program, The John Theurer Cancer Center at Hackensack, University Medical Center, D. Jurist Research Building, 40 Prospect Avenue, Hackensack, NJ 07601, USA
| | - Kip Nalley
- Sophic Systems Alliance Inc., 20271 Goldenrod Lane, Germantown, MD 20876, USA
| | - Natasha Milinovikj
- The Genomics and Biomarkers Program, The John Theurer Cancer Center at Hackensack, University Medical Center, D. Jurist Research Building, 40 Prospect Avenue, Hackensack, NJ 07601, USA
| | - Fathi Elloumi
- Sophic Systems Alliance Inc., 20271 Goldenrod Lane, Germantown, MD 20876, USA
| | - Steven Russell
- Siemens Corporate Research, IT Platforms, Princeton, NJ 08540, USA
| | - Andrew Pecora
- The Genomics and Biomarkers Program, The John Theurer Cancer Center at Hackensack, University Medical Center, D. Jurist Research Building, 40 Prospect Avenue, Hackensack, NJ 07601, USA
| | | | - Andre Goy
- The Genomics and Biomarkers Program, The John Theurer Cancer Center at Hackensack, University Medical Center, D. Jurist Research Building, 40 Prospect Avenue, Hackensack, NJ 07601, USA
| |
Collapse
|
19
|
Theobald CN, Stover DG, Choma NN, Hathaway J, Green JK, Peterson NB, Sponsler KC, Vasilevskis EE, Kripalani S, Sergent J, Brown NJ, Denny JC. The effect of reducing maximum shift lengths to 16 hours on internal medicine interns' educational opportunities. ACADEMIC MEDICINE : JOURNAL OF THE ASSOCIATION OF AMERICAN MEDICAL COLLEGES 2013; 88:512-518. [PMID: 23425987 PMCID: PMC3638874 DOI: 10.1097/acm.0b013e318285800f] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
PURPOSE To evaluate educational experiences of internal medicine interns before and after maximum shift lengths were decreased from 30 hours to 16 hours. METHOD The authors compared educational experiences of internal medicine interns at Vanderbilt University Medical Center before (2010; 47 interns) and after (2011; 50 interns) duty hours restrictions were implemented in July 2011. The authors compared number of inpatient encounters, breadth of concepts in notes, exposure to five common presenting problems, procedural experience, and attendance at teaching conferences. RESULTS Following the duty hours restrictions, interns cared for more unique patients (mean 118 versus 140 patients per intern, P = .005) and wrote more history and physicals (mean 73 versus 88, P = .005). Documentation included more total concepts after the 16-hour maximum shift implementation, with a 14% increase for history and physicals (338 versus 387, P < .001) and a 10% increase for progress notes (316 versus 349, P < .001). There was no difference in the median number of selected procedures performed (6 versus 6, P = 0.94). Attendance was higher at the weekly chief resident conference (60% versus 68% of expected attendees, P < .001) but unchanged at morning report conferences (79% versus 78%, P = .49). CONCLUSIONS Intern clinical exposure did not decrease after implementation of the 16-hour shift length restriction. In fact, interns saw more patients, produced more detailed notes, and attended more conferences following duty hours restrictions.
Collapse
Affiliation(s)
- Cecelia N Theobald
- Department of Medicine, School of Medicine, Vanderbilt University, Nashville, Tennessee 37212, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Ritchie MD, Denny JC, Zuvich RL, Crawford DC, Schildcrout JS, Bastarache L, Ramirez AH, Mosley JD, Pulley JM, Basford MA, Bradford Y, Rasmussen LV, Pathak J, Chute CG, Kullo IJ, McCarty CA, Chisholm RL, Kho AN, Carlson CS, Larson EB, Jarvik GP, Sotoodehnia N, Manolio TA, Li R, Masys DR, Haines JL, Roden DM. Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk. Circulation 2013; 127:1377-85. [PMID: 23463857 DOI: 10.1161/circulationaha.112.000604] [Citation(s) in RCA: 148] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
BACKGROUND ECG QRS duration, a measure of cardiac intraventricular conduction, varies ≈2-fold in individuals without cardiac disease. Slow conduction may promote re-entrant arrhythmias. METHODS AND RESULTS We performed a genome-wide association study to identify genomic markers of QRS duration in 5272 individuals without cardiac disease selected from electronic medical record algorithms at 5 sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium QRS genome-wide association study meta-analysis. Twenty-three single-nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 single-nucleotide polymorphisms were in the chromosome 3 SCN5A and SCN10A loci, where the most significant single-nucleotide polymorphisms were rs1805126 in SCN5A with P=1.2×10(-8) (eMERGE) and P=2.5×10(-20) (CHARGE) and rs6795970 in SCN10A with P=6×10(-6) (eMERGE) and P=5×10(-27) (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies on variants in these 5 loci in 13859 European Americans to search for diagnoses associated with these markers. Phenome-wide association study identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5272 "heart-healthy" study population. CONCLUSIONS We conclude that DNA biobanks coupled to electronic medical records not only provide a platform for genome-wide association study but also may allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The phenome-wide association study approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
Collapse
Affiliation(s)
- Marylyn D Ritchie
- Department of Biochemistry and Molecular Biology, Pennsylvania State University, University Park, PA, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Abstract
Abstract: The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.
Collapse
Affiliation(s)
- Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, United States of America.
| |
Collapse
|
22
|
Liu M, McPeek Hinz ER, Matheny ME, Denny JC, Schildcrout JS, Miller RA, Xu H. Comparative analysis of pharmacovigilance methods in the detection of adverse drug reactions using electronic medical records. J Am Med Inform Assoc 2012; 20:420-6. [PMID: 23161894 DOI: 10.1136/amiajnl-2012-001119] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Medication safety requires that each drug be monitored throughout its market life as early detection of adverse drug reactions (ADRs) can lead to alerts that prevent patient harm. Recently, electronic medical records (EMRs) have emerged as a valuable resource for pharmacovigilance. This study examines the use of retrospective medication orders and inpatient laboratory results documented in the EMR to identify ADRs. METHODS Using 12 years of EMR data from Vanderbilt University Medical Center (VUMC), we designed a study to correlate abnormal laboratory results with specific drug administrations by comparing the outcomes of a drug-exposed group and a matched unexposed group. We assessed the relative merits of six pharmacovigilance measures used in spontaneous reporting systems (SRSs): proportional reporting ratio (PRR), reporting OR (ROR), Yule's Q (YULE), the χ(2) test (CHI), Bayesian confidence propagation neural networks (BCPNN), and a gamma Poisson shrinker (GPS). RESULTS We systematically evaluated the methods on two independently constructed reference standard datasets of drug-event pairs. The dataset of Yoon et al contained 470 drug-event pairs (10 drugs and 47 laboratory abnormalities). Using VUMC's EMR, we created another dataset of 378 drug-event pairs (nine drugs and 42 laboratory abnormalities). Evaluation on our reference standard showed that CHI, ROR, PRR, and YULE all had the same F score (62%). When the reference standard of Yoon et al was used, ROR had the best F score of 68%, with 77% precision and 61% recall. CONCLUSIONS Results suggest that EMR-derived laboratory measurements and medication orders can help to validate previously reported ADRs, and detect new ADRs.
Collapse
Affiliation(s)
- Mei Liu
- Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, USA
| | | | | | | | | | | | | |
Collapse
|
23
|
Denny JC, Choma NN, Peterson JF, Miller RA, Bastarache L, Li M, Peterson NB. Natural language processing improves identification of colorectal cancer testing in the electronic medical record. Med Decis Making 2012; 32:188-197. [PMID: 21393557 PMCID: PMC9616628 DOI: 10.1177/0272989x11400418] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/16/2023]
Abstract
BACKGROUND Difficulty identifying patients in need of colorectal cancer (CRC) screening contributes to low screening rates. OBJECTIVE To use Electronic Health Record (EHR) data to identify patients with prior CRC testing. DESIGN A clinical natural language processing (NLP) system was modified to identify 4 CRC tests (colonoscopy, flexible sigmoidoscopy, fecal occult blood testing, and double contrast barium enema) within electronic clinical documentation. Text phrases in clinical notes referencing CRC tests were interpreted by the system to determine whether testing was planned or completed and to estimate the date of completed tests. SETTING Large academic medical center. PATIENTS 200 patients ≥ 50 years old who had completed ≥ 2 non-acute primary care visits within a 1-year period. MEASURES Recall and precision of the NLP system, billing records, and human chart review were compared to a reference standard of human review of all available information sources. RESULTS For identification of all CRC tests, recall and precision were as follows: NLP system (recall 93%, precision 94%), chart review (74%, 98%), and billing records review (44%, 83%). Recall and precision for identification of patients in need of screening were: NLP system (recall 95%, precision 88%), chart review (99%, 82%), and billing records (99%, 67%). LIMITATIONS Small sample size and requirement for a robust EHR. CONCLUSIONS Applying NLP to EHR records detected more CRC tests than either manual chart review or billing records review alone. NLP had better precision but marginally lower recall to identify patients who were due for CRC screening than billing record review.
Collapse
Affiliation(s)
- Joshua C. Denny
- Division of General Internal Medicine and Public Health, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Neesha N. Choma
- Division of General Internal Medicine and Public Health, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee
- Veterans Administration, Tennessee Valley Healthcare System, Tennessee Valley Geriatric Research Education Clinical Center (GRECC), Nashville, Tennessee
| | - Josh F. Peterson
- Division of General Internal Medicine and Public Health, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
- Veterans Administration, Tennessee Valley Healthcare System, Tennessee Valley Geriatric Research Education Clinical Center (GRECC), Nashville, Tennessee
| | - Randolph A. Miller
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee
| | - Ming Li
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee
| | - Neeraja B. Peterson
- Division of General Internal Medicine and Public Health, Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee
| |
Collapse
|
24
|
Stenner SP, Johnson KB, Denny JC. PASTE: patient-centered SMS text tagging in a medication management system. J Am Med Inform Assoc 2011; 19:368-74. [PMID: 21984605 DOI: 10.1136/amiajnl-2011-000484] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE To evaluate the performance of a system that extracts medication information and administration-related actions from patient short message service (SMS) messages. DESIGN Mobile technologies provide a platform for electronic patient-centered medication management. MyMediHealth (MMH) is a medication management system that includes a medication scheduler, a medication administration record, and a reminder engine that sends text messages to cell phones. The object of this work was to extend MMH to allow two-way interaction using mobile phone-based SMS technology. Unprompted text-message communication with patients using natural language could engage patients in their healthcare, but presents unique natural language processing challenges. The authors developed a new functional component of MMH, the Patient-centered Automated SMS Tagging Engine (PASTE). The PASTE web service uses natural language processing methods, custom lexicons, and existing knowledge sources to extract and tag medication information from patient text messages. MEASUREMENTS A pilot evaluation of PASTE was completed using 130 medication messages anonymously submitted by 16 volunteers via a website. System output was compared with manually tagged messages. RESULTS Verified medication names, medication terms, and action terms reached high F-measures of 91.3%, 94.7%, and 90.4%, respectively. The overall medication name F-measure was 79.8%, and the medication action term F-measure was 90%. CONCLUSION Other studies have demonstrated systems that successfully extract medication information from clinical documents using semantic tagging, regular expression-based approaches, or a combination of both approaches. This evaluation demonstrates the feasibility of extracting medication information from patient-generated medication messages.
Collapse
Affiliation(s)
- Shane P Stenner
- Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, Tennessee, USA.
| | | | | |
Collapse
|
25
|
Xu H, AbdelRahman S, Lu Y, Denny JC, Doan S. Applying semantic-based probabilistic context-free grammar to medical language processing--a preliminary study on parsing medication sentences. J Biomed Inform 2011; 44:1068-75. [PMID: 21856440 DOI: 10.1016/j.jbi.2011.08.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2011] [Revised: 07/26/2011] [Accepted: 08/07/2011] [Indexed: 11/20/2022]
Abstract
Semantic-based sublanguage grammars have been shown to be an efficient method for medical language processing. However, given the complexity of the medical domain, parsers using such grammars inevitably encounter ambiguous sentences, which could be interpreted by different groups of production rules and consequently result in two or more parse trees. One possible solution, which has not been extensively explored previously, is to augment productions in medical sublanguage grammars with probabilities to resolve the ambiguity. In this study, we associated probabilities with production rules in a semantic-based grammar for medication findings and evaluated its performance on reducing parsing ambiguity. Using the existing data set from 2009 i2b2 NLP (Natural Language Processing) challenge for medication extraction, we developed a semantic-based CFG (Context Free Grammar) for parsing medication sentences and manually created a Treebank of 4564 medication sentences from discharge summaries. Using the Treebank, we derived a semantic-based PCFG (Probabilistic Context Free Grammar) for parsing medication sentences. Our evaluation using a 10-fold cross validation showed that the PCFG parser dramatically improved parsing performance when compared to the CFG parser.
Collapse
Affiliation(s)
- Hua Xu
- Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, TN 37232, USA.
| | | | | | | | | |
Collapse
|
26
|
Kho AN, Pacheco JA, Peissig PL, Rasmussen L, Newton KM, Weston N, Crane PK, Pathak J, Chute CG, Bielinski SJ, Kullo IJ, Li R, Manolio TA, Chisholm RL, Denny JC. Electronic medical records for genetic research: results of the eMERGE consortium. Sci Transl Med 2011; 3:79re1. [PMID: 21508311 PMCID: PMC3690272 DOI: 10.1126/scitranslmed.3001807] [Citation(s) in RCA: 246] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Clinical data in electronic medical records (EMRs) are a potential source of longitudinal clinical data for research. The Electronic Medical Records and Genomics Network (eMERGE) investigates whether data captured through routine clinical care using EMRs can identify disease phenotypes with sufficient positive and negative predictive values for use in genome-wide association studies (GWAS). Using data from five different sets of EMRs, we have identified five disease phenotypes with positive predictive values of 73 to 98% and negative predictive values of 98 to 100%. Most EMRs captured key information (diagnoses, medications, laboratory tests) used to define phenotypes in a structured format. We identified natural language processing as an important tool to improve case identification rates. Efforts and incentives to increase the implementation of interoperable EMRs will markedly improve the availability of clinical data for genomics research.
Collapse
Affiliation(s)
- Abel N Kho
- Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Wilke RA, Xu H, Denny JC, Roden DM, Krauss RM, McCarty CA, Davis RL, Skaar T, Lamba J, Savova G. The emerging role of electronic medical records in pharmacogenomics. Clin Pharmacol Ther 2011; 89:379-86. [PMID: 21248726 DOI: 10.1038/clpt.2010.260] [Citation(s) in RCA: 130] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Health-care information technology and genotyping technology are both advancing rapidly, creating new opportunities for medical and scientific discovery. The convergence of these two technologies is now facilitating genetic association studies of unprecedented size within the context of routine clinical care. As a result, the medical community will soon be presented with a number of novel opportunities to bring functional genomics to the bedside in the area of pharmacotherapy. By linking biological material to comprehensive medical records, large multi-institutional biobanks are now poised to advance the field of pharmacogenomics through three distinct mechanisms: (i) retrospective assessment of previously known findings in a clinical practice-based setting, (ii) discovery of new associations in huge observational cohorts, and (iii) prospective application in a setting capable of providing real-time decision support. This review explores each of these translational mechanisms within a historical framework.
Collapse
Affiliation(s)
- R A Wilke
- Department of Medicine, Division of Clinical Pharmacology, Vanderbilt University Medical Center, Nashville, Tennessee, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Denny JC, Ritchie MD, Crawford DC, Schildcrout JS, Ramirez AH, Pulley JM, Basford MA, Masys DR, Haines JL, Roden DM. Identification of genomic predictors of atrioventricular conduction: using electronic medical records as a tool for genome science. Circulation 2010; 122:2016-21. [PMID: 21041692 PMCID: PMC2991609 DOI: 10.1161/circulationaha.110.948828] [Citation(s) in RCA: 107] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
BACKGROUND Recent genome-wide association studies in which selected community populations are used have identified genomic signals in SCN10A influencing PR duration. The extent to which this can be demonstrated in cohorts derived from electronic medical records is unknown. METHODS AND RESULTS We performed a genome-wide association study on 2334 European American patients with normal ECGs without evidence of prior heart disease from the Vanderbilt DNA databank, BioVU, which accrues subjects from routine patient care. Subjects were identified by combinations of natural language processing, laboratory queries, and billing code queries of deidentified medical record data. Subjects were 58% female, of mean (± SD) age 54 ± 15 years, and had mean PR intervals of 158 ± 18 ms. Genotyping was performed with the use of the Illumina Human660W-Quad platform. Our results identify 4 single nucleotide polymorphisms (rs6800541, rs6795970, rs6798015, rs7430477) linked to SCN10A associated with PR interval (P=5.73 × 10(-7) to 1.78 × 10(-6)). CONCLUSIONS This genome-wide association study confirms a gene heretofore not implicated in cardiac pathophysiology as a modulator of PR interval in humans. This study is one of the first replication genome-wide association studies performed with the use of an electronic medical records-derived cohort, supporting their further use for genotype-phenotype analyses.
Collapse
Affiliation(s)
- Joshua C Denny
- Office of Personalized Medicine, Vanderbilt University School of Medicine, Nashville, TN 37232-0575, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Xu H, Lu Y, Jiang M, Liu M, Denny JC, Dai Q, Peterson NB. Mining Biomedical Literature for Terms related to Epidemiologic Exposures. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2010; 2010:897-901. [PMID: 21347108 PMCID: PMC3041399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Epidemiologic studies contribute greatly to evidence-based medicine by identifying risk factors for diseases and determining optimal treatments for clinical practice. However, there is very limited effort on automatic extraction of knowledge from epidemiologic articles, such as exposures, outcomes, and their relations. In this initial study, we developed a system that consists of a natural language processing (NLP) engine and a rule-based classifier, to automatically extract exposure-related terms from titles of epidemiologic articles. The evaluation using 450 titles annotated by an epidemiologist showed the highest F-measure of 0.646 (Precision 0.610 and Recall 0.688) using in-exact matching, which indicated the feasibility of automated methods on mining epidemiologic literature. Further analysis of terms related to epidemiologic exposures suggested that although UMLS would have reasonable coverage, more appropriate semantic classifications of epidemiologic exposures would be required.
Collapse
Affiliation(s)
- Hua Xu
- Department of Biomedical Informatics
| | | | | | | | | | | | | |
Collapse
|
30
|
Ramirez AH, Schildcrout JS, Blakemore DL, Masys DR, Pulley JM, Basford MA, Roden DM, Denny JC. Modulators of normal electrocardiographic intervals identified in a large electronic medical record. Heart Rhythm 2010; 8:271-7. [PMID: 21044898 DOI: 10.1016/j.hrthm.2010.10.034] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2010] [Accepted: 10/26/2010] [Indexed: 10/18/2022]
Abstract
BACKGROUND Traditional electrocardiographic (ECG) reference ranges were derived from studies in communities or clinical trial populations. The distribution of ECG parameters in a large population presenting to a healthcare system has not been studied. OBJECTIVE The purpose of this study was to define the contribution of age, race, gender, height, body mass index, and type 2 diabetes mellitus to normal ECG parameters in a population presenting to a healthcare system. METHODS Study subjects were obtained from the Vanderbilt Synthetic Derivative, a de-identified image of the electronic medical record (EMR), containing more than 20 years of records on 1.7 million subjects. We identified 63,177 unique subjects with an ECG that was read as "normal" by the reviewing cardiologist. Using combinations of natural language processing and laboratory and billing code queries, we identified a subset of 32,949 subjects without cardiovascular disease, interfering medications, or abnormal electrolytes. The ethnic makeup was 77% Caucasian, 13% African American, 1% Hispanic, 1% Asian, and 8% unknown. RESULTS The range that included 95% of normal PR intervals was 125-196 ms, QRS 69-103 ms, QT interval corrected with Bazett formula 365-458 ms, and heart rate 54-96 bpm. Linear regression modeling of patient characteristic effects reproduced known age and gender effects and identified novel associations with race, body mass index, and type 2 diabetes mellitus. A web-based application for patient-specific normal ranges is available online at http://biostat.mc.vanderbilt.edu/ECGPredictionInterval. CONCLUSION Analysis of a large set of EMR-derived normal ECGs reproduced known associations, found new relationships, and established patient-specific normal ranges. Such knowledge informs clinical and genetic research and may improve understanding of normal cardiac physiology.
Collapse
Affiliation(s)
- Andrea H Ramirez
- Department of Medicine, Vanderbilt University, Nashville, Tennessee 37232, USA
| | | | | | | | | | | | | | | |
Collapse
|
31
|
Denny JC, Peterson JF, Choma NN, Xu H, Miller RA, Bastarache L, Peterson NB. Extracting timing and status descriptors for colonoscopy testing from electronic medical records. J Am Med Inform Assoc 2010; 17:383-8. [PMID: 20595304 DOI: 10.1136/jamia.2010.004804] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Colorectal cancer (CRC) screening rates are low despite confirmed benefits. The authors investigated the use of natural language processing (NLP) to identify previous colonoscopy screening in electronic records from a random sample of 200 patients at least 50 years old. The authors developed algorithms to recognize temporal expressions and 'status indicators', such as 'patient refused', or 'test scheduled'. The new methods were added to the existing KnowledgeMap concept identifier system, and the resulting system was used to parse electronic medical records (EMR) to detect completed colonoscopies. Using as the 'gold standard' expert physicians' manual review of EMR notes, the system identified timing references with a recall of 0.91 and precision of 0.95, colonoscopy status indicators with a recall of 0.82 and precision of 0.95, and references to actually completed colonoscopies with recall of 0.93 and precision of 0.95. The system was superior to using colonoscopy billing codes alone. Health services researchers and clinicians may find NLP a useful adjunct to traditional methods to detect CRC screening status. Further investigations must validate extension of NLP approaches for other types of CRC screening applications.
Collapse
Affiliation(s)
- Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA.
| | | | | | | | | | | | | |
Collapse
|
32
|
Liu M, Denny JC, Mani S, Chen Y, Hu Y, Xu H. Identifying potential drugs that induce QT prolongation using electronic medical records. BMC Bioinformatics 2010. [PMCID: PMC3290073 DOI: 10.1186/1471-2105-11-s4-p2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
33
|
Ritchie MD, Denny JC, Crawford DC, Ramirez AH, Weiner JB, Pulley JM, Basford MA, Brown-Gentry K, Balser JR, Masys DR, Haines JL, Roden DM. Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. Am J Hum Genet 2010; 86:560-72. [PMID: 20362271 DOI: 10.1016/j.ajhg.2010.03.003] [Citation(s) in RCA: 255] [Impact Index Per Article: 18.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2009] [Revised: 02/18/2010] [Accepted: 03/01/2010] [Indexed: 11/20/2022] Open
Abstract
Large-scale DNA databanks linked to electronic medical record (EMR) systems have been proposed as an approach for rapidly generating large, diverse cohorts for discovery and replication of genotype-phenotype associations. However, the extent to which such resources are capable of delivering on this promise is unknown. We studied whether an EMR-linked DNA biorepository can be used to detect known genotype-phenotype associations for five diseases. Twenty-one SNPs previously implicated as common variants predisposing to atrial fibrillation, Crohn disease, multiple sclerosis, rheumatoid arthritis, or type 2 diabetes were successfully genotyped in 9483 samples accrued over 4 mo into BioVU, the Vanderbilt University Medical Center DNA biobank. Previously reported odds ratios (OR(PR)) ranged from 1.14 to 2.36. For each phenotype, natural language processing techniques and billing-code queries were used to identify cases (n = 70-698) and controls (n = 808-3818) from deidentified health records. Each of the 21 tests of association yielded point estimates in the expected direction. Previous genotype-phenotype associations were replicated (p < 0.05) in 8/14 cases when the OR(PR) was > 1.25, and in 0/7 with lower OR(PR). Statistically significant associations were detected in all analyses that were adequately powered. In each of the five diseases studied, at least one previously reported association was replicated. These data demonstrate that phenotypes representing clinical diagnoses can be extracted from EMR systems, and they support the use of DNA resources coupled to EMR systems as tools for rapid generation of large data sets required for replication of associations found in research cohorts and for discovery in genome science.
Collapse
Affiliation(s)
- Marylyn D Ritchie
- Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication information extraction system for clinical narratives. J Am Med Inform Assoc 2010; 17:19-24. [PMID: 20064797 DOI: 10.1197/jamia.m3378] [Citation(s) in RCA: 303] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Medication information is one of the most important types of clinical data in electronic medical records. It is critical for healthcare safety and quality, as well as for clinical research that uses electronic medical record data. However, medication data are often recorded in clinical notes as free-text. As such, they are not accessible to other computerized applications that rely on coded data. We describe a new natural language processing system (MedEx), which extracts medication information from clinical notes. MedEx was initially developed using discharge summaries. An evaluation using a data set of 50 discharge summaries showed it performed well on identifying not only drug names (F-measure 93.2%), but also signature information, such as strength, route, and frequency, with F-measures of 94.5%, 93.9%, and 96.0% respectively. We then applied MedEx unchanged to outpatient clinic visit notes. It performed similarly with F-measures over 90% on a set of 25 clinic visit notes.
Collapse
Affiliation(s)
- Hua Xu
- Department of Biomedical Informatics, Vanderbilt University, School of Medicine, Nashville, Tennessee 37232, USA.
| | | | | | | | | | | |
Collapse
|
35
|
Denny JC, Bastarache L, Sastre EA, Spickard A. Tracking medical students' clinical experiences using natural language processing. J Biomed Inform 2009; 42:781-9. [PMID: 19236956 PMCID: PMC5490452 DOI: 10.1016/j.jbi.2009.02.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2008] [Revised: 02/10/2009] [Accepted: 02/13/2009] [Indexed: 10/21/2022]
Abstract
Graduate medical students must demonstrate competency in clinical skills. Current tracking methods rely either on manual efforts or on simple electronic entry to record clinical experience. We evaluated automated methods to locate 10 institution-defined core clinical problems from three medical students' clinical notes (n=290). Each note was processed with section header identification algorithms and the KnowledgeMap concept identifier to locate Unified Medical Language System (UMLS) concepts. The best performing automated search strategies accurately classified documents containing primary discussions to the core clinical problems with area under receiver operator characteristic curve of 0.90-0.94. Recall and precision for UMLS concept identification was 0.91 and 0.92, respectively. Of the individual note section, concepts found within the chief complaint, history of present illness, and assessment and plan were the strongest predictors of relevance. This automated method of tracking can provide detailed, pertinent reports of clinical experience that does not require additional work from medical trainees. The coupling of section header identification and concept identification holds promise for other natural language processing tasks, such as clinical research or phenotype identification.
Collapse
Affiliation(s)
- Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Eskind Biomedical Library, Room 442, 2209 Garland Ave., Nashville, TN 37232, USA.
| | | | | | | |
Collapse
|
36
|
Discerning tumor status from unstructured MRI reports--completeness of information in existing reports and utility of automated natural language processing. J Digit Imaging 2009; 23:119-32. [PMID: 19484309 PMCID: PMC2837158 DOI: 10.1007/s10278-009-9215-7] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2009] [Revised: 04/07/2009] [Accepted: 05/02/2009] [Indexed: 11/24/2022] Open
Abstract
Information in electronic medical records is often in an unstructured free-text format. This format presents challenges for expedient data retrieval and may fail to convey important findings. Natural language processing (NLP) is an emerging technique for rapid and efficient clinical data retrieval. While proven in disease detection, the utility of NLP in discerning disease progression from free-text reports is untested. We aimed to (1) assess whether unstructured radiology reports contained sufficient information for tumor status classification; (2) develop an NLP-based data extraction tool to determine tumor status from unstructured reports; and (3) compare NLP and human tumor status classification outcomes. Consecutive follow-up brain tumor magnetic resonance imaging reports (2000–2007) from a tertiary center were manually annotated using consensus guidelines on tumor status. Reports were randomized to NLP training (70%) or testing (30%) groups. The NLP tool utilized a support vector machines model with statistical and rule-based outcomes. Most reports had sufficient information for tumor status classification, although 0.8% did not describe status despite reference to prior examinations. Tumor size was unreported in 68.7% of documents, while 50.3% lacked data on change magnitude when there was detectable progression or regression. Using retrospective human classification as the gold standard, NLP achieved 80.6% sensitivity and 91.6% specificity for tumor status determination (mean positive predictive value, 82.4%; negative predictive value, 92.0%). In conclusion, most reports contained sufficient information for tumor status determination, though variable features were used to describe status. NLP demonstrated good accuracy for tumor status classification and may have novel application for automated disease status classification from electronic databases.
Collapse
|