1
|
Hosseini M, Rasekh AH, Keshavarzi A. Improving clinical abbreviation sense disambiguation using attention-based Bi-LSTM and hybrid balancing techniques in imbalanced datasets. J Eval Clin Pract 2024; 30:1327-1336. [PMID: 39031903 DOI: 10.1111/jep.14041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 04/29/2024] [Accepted: 05/21/2024] [Indexed: 07/22/2024]
Abstract
RATIONALE Clinical abbreviations pose a challenge for clinical decision support systems due to their ambiguity. Additionally, clinical datasets often suffer from class imbalance, hindering the classification of such data. This imbalance leads to classifiers with low accuracy and high error rates. Traditional feature-engineered models struggle with this task, and class imbalance is a known factor that reduces the performance of neural network techniques. AIMS AND OBJECTIVES This study proposes an attention-based bidirectional long short-term memory (Bi-LSTM) model to improve clinical abbreviation disambiguation in clinical documents. We aim to address the challenges of limited training data and class imbalance by employing data generation techniques like reverse substitution and data augmentation with synonym substitution. METHOD We utilise a Bi-LSTM classification model with an attention mechanism to disambiguate each abbreviation. The model's performance is evaluated based on accuracy for each abbreviation. To address the limitations of imbalanced data, we employ data generation techniques to create a more balanced dataset. RESULTS The evaluation results demonstrate that our data balancing technique significantly improves the model's accuracy by 2.08%. Furthermore, the proposed attention-based Bi-LSTM model achieves an accuracy of 96.09% on the UMN dataset, outperforming state-of-the-art results. CONCLUSION Deep neural network methods, particularly Bi-LSTM, offer promising alternatives to traditional feature-engineered models for clinical abbreviation disambiguation. By employing data generation techniques, we can address the challenges posed by limited-resource and imbalanced clinical datasets. This approach leads to a significant improvement in model accuracy for clinical abbreviation disambiguation tasks.
Collapse
Affiliation(s)
- Manda Hosseini
- Department of Computer Engineering, Zand Institute of Higher Education, Shiraz, Iran
| | - Amir Hossein Rasekh
- Department of Computer Engineering, Zand Institute of Higher Education, Shiraz, Iran
| | - Amin Keshavarzi
- Department of Computer Engineering, Marvdasht Branch, Islamic Azad University, Marvdasht, Iran
| |
Collapse
|
2
|
Ramoeletsi S, Tlou B. Challenges of clinical accompaniment amongst undergraduate nursing students: University of KwaZulu-Natal. Health SA 2024; 29:2535. [PMID: 39114334 PMCID: PMC11304205 DOI: 10.4102/hsag.v29i0.2535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Accepted: 02/19/2024] [Indexed: 08/10/2024] Open
Abstract
Background Clinical accompaniment is an activity predominantly supervised by the clinical facilitator to develop the skills of the students. In South Africa, clinical accompaniment aims to develop the skills of the students to equip them in delivering efficient health services to the patients. Previous studies revealed that students experienced challenges and were negatively affected due to inadequate clinical accompaniment in the learning practice. Aim The aim was to determine the challenges faced by University of KwaZulu-Natal (UKZN) undergraduate nursing students during their clinical accompaniment. Methods An observational cross-sectional study design, with an analytic component was implemented. Questionnaires were used to collect data. Of the 400 registered nursing students, 245 were undergraduates; of these, 241 consented to participate in this study. Data captured into SPSS Statistics Package V28. ANOVA were used in comparing challenges amongst participants. A p-value less than 0.05 was considered significant. Results A total of 241 participants responded to the questionnaires, which yielded a response rate of 98.4%. This study comprised first-year (32.4%), second-year (32.8%) and third-year (34.9%) students. There was no remarkable difference in terms of challenges amongst study participants (1st; 2nd; 3rd), p=0.592. Conclusion This study revealed the challenges faced by undergraduate nursing students during their clinical accompaniment. Contribution Study results might assist in developing effective guidelines to resolve the challenges encountered by students.
Collapse
Affiliation(s)
- Seaka Ramoeletsi
- Department of Public Health, Discipline of Public Health Medicine, University of KwaZulu-Natal, Durban, South Africa
| | - Boikhutso Tlou
- Department of Public Health, Discipline of Public Health Medicine, University of KwaZulu-Natal, Durban, South Africa
| |
Collapse
|
3
|
Chen F, Zhang G, Chen S, Callahan T, Weng C. Clinical Note Structural Knowledge Improves Word Sense Disambiguation. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024; 2024:515-524. [PMID: 38827062 PMCID: PMC11141859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Clinical notes are full of ambiguous medical abbreviations. Contextual knowledge has been leveraged by recent learning-based approaches for sense disambiguation. Previous findings indicated that structural elements of clinical notes entail useful characteristics for informing different interpretations of abbreviations, yet they have remained underutilized and have not been fully investigated. To our best knowledge, the only study exploring note structures simply enumerated the headers in the notes, where such representations are not semantically meaningful. This paper describes a learning-based approach using the note structure represented by the semantic types predefined in Unified Medical Language System (UMLS). We evaluated the representation in addition to the widely used N-gram with three learning models on two different datasets. Experiments indicate that our feature augmentation consistently improved model performance for abbreviation disambiguation, with the optimal F1 score of 0.93.
Collapse
Affiliation(s)
- Fangyi Chen
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Gongbo Zhang
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Si Chen
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Tiffany Callahan
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| |
Collapse
|
4
|
Schneider CV, Li T, Zhang D, Mezina AI, Rattan P, Huang H, Creasy KT, Scorletti E, Zandvakili I, Vujkovic M, Hehl L, Fiksel J, Park J, Wangensteen K, Risman M, Chang KM, Serper M, Carr RM, Schneider KM, Chen J, Rader DJ. Large-scale identification of undiagnosed hepatic steatosis using natural language processing. EClinicalMedicine 2023; 62:102149. [PMID: 37599905 PMCID: PMC10432816 DOI: 10.1016/j.eclinm.2023.102149] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 07/24/2023] [Accepted: 07/25/2023] [Indexed: 08/22/2023] Open
Abstract
Background Nonalcoholic fatty liver disease (NAFLD) is a major cause of liver-related morbidity in people with and without diabetes, but it is underdiagnosed, posing challenges for research and clinical management. Here, we determine if natural language processing (NLP) of data in the electronic health record (EHR) could identify undiagnosed patients with hepatic steatosis based on pathology and radiology reports. Methods A rule-based NLP algorithm was built using a Linguamatics literature text mining tool to search 2.15 million pathology report and 2.7 million imaging reports in the Penn Medicine EHR from November 2014, through December 2020, for evidence of hepatic steatosis. For quality control, two independent physicians manually reviewed randomly chosen biopsy and imaging reports (n = 353, PPV 99.7%). Findings After exclusion of individuals with other causes of hepatic steatosis, 3007 patients with biopsy-proven NAFLD and 42,083 patients with imaging-proven NAFLD were identified. Interestingly, elevated ALT was not a sensitive predictor of the presence of steatosis, and only half of the biopsied patients with steatosis ever received an ICD diagnosis code for the presence of NAFLD/NASH. There was a robust association for PNPLA3 and TM6SF2 risk alleles and steatosis identified by NLP. We identified 234 disorders that were significantly over- or underrepresented in all subjects with steatosis and identified changes in serum markers (e.g., GGT) associated with presence of steatosis. Interpretation This study demonstrates clear feasibility of NLP-based approaches to identify patients whose steatosis was indicated in imaging and pathology reports within a large healthcare system and uncovers undercoding of NAFLD in the general population. Identification of patients at risk could link them to improved care and outcomes. Funding The study was funded by US and German funding sources that did provide financial support only and had no influence or control over the research process.
Collapse
Affiliation(s)
- Carolin V. Schneider
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Medicine III, RWTH Aachen University, Aachen, Germany
| | - Tang Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - David Zhang
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Anya I. Mezina
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Puru Rattan
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Helen Huang
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kate Townsend Creasy
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Eleonora Scorletti
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Inuk Zandvakili
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Division of Digestive Diseases, Department of Internal Medicine, College of Medicine, University of Cincinnati, Cincinnati, OH 45267, USA
| | - Marijana Vujkovic
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Corporal Michael J. Crescenz VA Medical Center, Philadelphia, PA 19104, USA
| | - Leonida Hehl
- Department of Medicine III, RWTH Aachen University, Aachen, Germany
| | - Jacob Fiksel
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Joseph Park
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kirk Wangensteen
- Department of Medicine, Division of Gastroenterology and Hepatology, Mayo Clinic, Rochester, MN 55902, USA
| | - Marjorie Risman
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kyong-Mi Chang
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Corporal Michael J. Crescenz VA Medical Center, Philadelphia, PA 19104, USA
| | - Marina Serper
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Corporal Michael J. Crescenz VA Medical Center, Philadelphia, PA 19104, USA
| | - Rotonya M. Carr
- Department of Medicine, Division of Gastroenterology, University of Washington, Seattle, WA 98195, USA
| | | | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Daniel J. Rader
- Division of Translational Medicine and Human Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
5
|
Rajkomar A, Loreaux E, Liu Y, Kemp J, Li B, Chen MJ, Zhang Y, Mohiuddin A, Gottweis J. Deciphering clinical abbreviations with a privacy protecting machine learning system. Nat Commun 2022; 13:7456. [PMID: 36460656 PMCID: PMC9718734 DOI: 10.1038/s41467-022-35007-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 11/15/2022] [Indexed: 12/05/2022] Open
Abstract
Physicians write clinical notes with abbreviations and shorthand that are difficult to decipher. Abbreviations can be clinical jargon (writing "HIT" for "heparin induced thrombocytopenia"), ambiguous terms that require expertise to disambiguate (using "MS" for "multiple sclerosis" or "mental status"), or domain-specific vernacular ("cb" for "complicated by"). Here we train machine learning models on public web data to decode such text by replacing abbreviations with their meanings. We report a single translation model that simultaneously detects and expands thousands of abbreviations in real clinical notes with accuracies ranging from 92.1%-97.1% on multiple external test datasets. The model equals or exceeds the performance of board-certified physicians (97.6% vs 88.7% total accuracy). Our results demonstrate a general method to contextually decipher abbreviations and shorthand that is built without any privacy-compromising data.
Collapse
Affiliation(s)
- Alvin Rajkomar
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| | - Eric Loreaux
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| | - Yuchen Liu
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| | - Jonas Kemp
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| | - Benny Li
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| | - Ming-Jun Chen
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| | - Yi Zhang
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| | - Afroz Mohiuddin
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| | - Juraj Gottweis
- grid.420451.60000 0004 0635 6729Google, Mountain View, CA USA
| |
Collapse
|
6
|
Wu DW, Bernstein JA, Bejerano G. Discovering monogenic patients with a confirmed molecular diagnosis in millions of clinical notes with MonoMiner. Genet Med 2022; 24:2091-2102. [PMID: 35976265 DOI: 10.1016/j.gim.2022.07.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 07/04/2022] [Accepted: 07/05/2022] [Indexed: 11/19/2022] Open
Abstract
PURPOSE Cohort building is a powerful foundation for improving clinical care, performing biomedical research, recruiting for clinical trials, and many other applications. We set out to build a cohort of all monogenic patients with a definitive causal gene diagnosis in a 3-million patient hospital system. METHODS We define a subset (4461) of OMIM diseases that have at least 1 known monogenic causal gene. We then introduce MonoMiner, a natural language processing framework to identify molecularly confirmed monogenic patients from free-text clinical notes. RESULTS We show that ICD-10-CM codes cover only a fraction of monogenic diseases and that even where available, ICD-10-CM code‒based patient retrieval offers 0.14 precision. Searching by causal gene symbol offers great recall but has an even worse 0.07 precision. MonoMiner achieves 6 to 11 times higher precision (0.80), with 0.87 precision on disease diagnosis alone, tagging 4259 patients with 560 monogenic diseases and 534 causal genes, at 0.48 recall. CONCLUSION MonoMiner enables the discovery of a large, high-precision cohort of patients with monogenic diseases with an established molecular diagnosis, empowering numerous downstream uses. Because it relies solely on clinical notes, MonoMiner is highly portable, and its approach is adaptable to other domains and languages.
Collapse
Affiliation(s)
- David Wei Wu
- Department of Computer Science, Stanford University School of Engineering, Stanford, CA; Medical Scientist Training Program, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA
| | | | - Gill Bejerano
- Department of Computer Science, Stanford University School of Engineering, Stanford, CA; Department of Pediatrics, Stanford University School of Medicine, Stanford, CA; Department of Developmental Biology, Stanford University School of Medicine, Stanford, CA; Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA.
| |
Collapse
|
7
|
Link NB, Huang S, Cai T, Sun J, Dahal K, Costa L, Cho K, Liao K, Cai T, Hong C. Binary acronym disambiguation in clinical notes from electronic health records with an application in computational phenotyping. Int J Med Inform 2022; 162:104753. [PMID: 35405530 DOI: 10.1016/j.ijmedinf.2022.104753] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 03/11/2022] [Accepted: 03/27/2022] [Indexed: 01/05/2023]
Abstract
OBJECTIVE The use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce a semi-supervised method for binary acronym disambiguation, the task of classifying a target sense for acronyms in the clinical EHR notes. METHODS We developed a semi-supervised ensemble machine learning (CASEml) algorithm to automatically identify when an acronym means a target sense by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard semi-supervised method and a baseline metric selecting the most frequent acronym sense. Along with evaluating the performance of these methods for specific instances of acronyms, we evaluated the impact of acronym disambiguation on NLP-driven phenotyping of rheumatoid arthritis. RESULTS CASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art semi-supervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis. CONCLUSION CASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and semi-supervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.
Collapse
Affiliation(s)
- Nicholas B Link
- VA Boston Healthcare System, Boston, MA, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
| | - Sicong Huang
- VA Boston Healthcare System, Boston, MA, United States; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, United States
| | - Tianrun Cai
- VA Boston Healthcare System, Boston, MA, United States; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, United States
| | - Jiehuan Sun
- VA Boston Healthcare System, Boston, MA, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Kumar Dahal
- VA Boston Healthcare System, Boston, MA, United States; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, United States
| | - Lauren Costa
- VA Boston Healthcare System, Boston, MA, United States
| | - Kelly Cho
- VA Boston Healthcare System, Boston, MA, United States
| | - Katherine Liao
- VA Boston Healthcare System, Boston, MA, United States; Division of Rheumatology, Immunology, and Allergy, Brigham and Women's Hospital, Boston, MA, United States
| | - Tianxi Cai
- VA Boston Healthcare System, Boston, MA, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Chuan Hong
- VA Boston Healthcare System, Boston, MA, United States; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| |
Collapse
|
8
|
Kelly J, Wang C, Zhang J, Das S, Ren A, Warnekar P. Automated Mapping of Real-world Oncology Laboratory Data to LOINC. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2021:611-620. [PMID: 35308998 PMCID: PMC8861721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this study we seek to determine the efficacy of using automated mapping methods to reduce the manual mapping burden of laboratory data to LOINC(r) on a nationwide electronic health record derived oncology specific dataset. We developed novel encoding methodologies to vectorize free text lab data, and evaluated logistic regression, random forest, and knn machine learning classifiers. All machine learning models did significantly better than deterministic baseline algorithms. The best classifiers were random forest and were able to predict the correct LOINC code 94.5% of the time. Ensemble classifiers further increased accuracy, with the best ensemble classifier predicting the same code 80.5% of the time with an accuracy of 99%. We conclude that by using an automated laboratory mapping model we can both reduce manual mapping time, and increase quality of mappings, suggesting automated mapping is a viable tool in a real-world oncology dataset.
Collapse
Affiliation(s)
| | - Chen Wang
- Georgetown University, Washington D.C
| | | | | | - Anna Ren
- Flatiron Health Inc, New York, New York
| | | |
Collapse
|
9
|
Grossman Liu L, Grossman RH, Mitchell EG, Weng C, Natarajan K, Hripcsak G, Vawdrey DK. A deep database of medical abbreviations and acronyms for natural language processing. Sci Data 2021; 8:149. [PMID: 34078918 PMCID: PMC8172575 DOI: 10.1038/s41597-021-00929-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 04/27/2021] [Indexed: 12/05/2022] Open
Abstract
The recognition, disambiguation, and expansion of medical abbreviations and acronyms is of upmost importance to prevent medically-dangerous misinterpretation in natural language processing. To support recognition, disambiguation, and expansion, we present the Medical Abbreviation and Acronym Meta-Inventory, a deep database of medical abbreviations. A systematic harmonization of eight source inventories across multiple healthcare specialties and settings identified 104,057 abbreviations with 170,426 corresponding senses. Automated cross-mapping of synonymous records using state-of-the-art machine learning reduced redundancy, which simplifies future application. Additional features include semi-automated quality control to remove errors. The Meta-Inventory demonstrated high completeness or coverage of abbreviations and senses in new clinical text, a substantial improvement over the next largest repository (6–14% increase in abbreviation coverage; 28–52% increase in sense coverage). To our knowledge, the Meta-Inventory is the most complete compilation of medical abbreviations and acronyms in American English to-date. The multiple sources and high coverage support application in varied specialties and settings. This allows for cross-institutional natural language processing, which previous inventories did not support. The Meta-Inventory is available at https://bit.ly/github-clinical-abbreviations. Measurement(s) | Controlled Vocabulary • Linguistic Form | Technology Type(s) | digital curation • data combination | Sample Characteristic - Location | United States of America |
Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.14068949
Collapse
Affiliation(s)
- Lisa Grossman Liu
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.
| | | | - Elliot G Mitchell
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - David K Vawdrey
- Department of Biomedical Informatics, Columbia University, New York, NY, USA.,Steele Institute for Health Innovation, Geisinger, Danville, PA, USA
| |
Collapse
|
10
|
Van Vleck TT, Chan L, Coca SG, Craven CK, Do R, Ellis SB, Kannry JL, Loos RJF, Bonis PA, Cho J, Nadkarni GN. Augmented intelligence with natural language processing applied to electronic health records for identifying patients with non-alcoholic fatty liver disease at risk for disease progression. Int J Med Inform 2019; 129:334-341. [PMID: 31445275 PMCID: PMC6717556 DOI: 10.1016/j.ijmedinf.2019.06.028] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Revised: 05/20/2019] [Accepted: 06/28/2019] [Indexed: 02/08/2023]
Abstract
OBJECTIVE Electronic health record (EHR) systems contain structured data (such as diagnostic codes) and unstructured data (clinical documentation). Clinical insights can be derived from analyzing both. The use of natural language processing (NLP) algorithms to effectively analyze unstructured data has been well demonstrated. Here we examine the utility of NLP for the identification of patients with non-alcoholic fatty liver disease, assess patterns of disease progression, and identify gaps in care related to breakdown in communication among providers. MATERIALS AND METHODS All clinical notes available on the 38,575 patients enrolled in the Mount Sinai BioMe cohort were loaded into the NLP system. We compared analysis of structured and unstructured EHR data using NLP, free-text search, and diagnostic codes with validation against expert adjudication. We then used the NLP findings to measure physician impression of progression from early-stage NAFLD to NASH or cirrhosis. Similarly, we used the same NLP findings to identify mentions of NAFLD in radiology reports that did not persist into clinical notes. RESULTS Out of 38,575 patients, we identified 2,281 patients with NAFLD. From the remainder, 10,653 patients with similar data density were selected as a control group. NLP outperformed ICD and text search in both sensitivity (NLP: 0.93, ICD: 0.28, text search: 0.81) and F2 score (NLP: 0.92, ICD: 0.34, text search: 0.81). Of 2281 NAFLD patients, 673 (29.5%) were believed to have progressed to NASH or cirrhosis. Among 176 where NAFLD was noted prior to NASH, the average progression time was 410 days. 619 (27.1%) NAFLD patients had it documented only in radiology notes and not acknowledged in other forms of clinical documentation. Of these, 170 (28.4%) were later identified as having likely developed NASH or cirrhosis after a median 1057.3 days. DISCUSSION NLP-based approaches were more accurate at identifying NAFLD within the EHR than ICD/text search-based approaches. Suspected NAFLD on imaging is often not acknowledged in subsequent clinical documentation. Many such patients are later found to have more advanced liver disease. Analysis of information flows demonstrated loss of key information that could have been used to help prevent the progression of early NAFLD (NAFL) to NASH or cirrhosis. CONCLUSION For identification of NAFLD, NLP performed better than alternative selection modalities. It then facilitated analysis of knowledge flow between physician and enabled the identification of breakdowns where key information was lost that could have slowed or prevented later disease progression.
Collapse
Affiliation(s)
- Tielman T Van Vleck
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, USA.
| | - Lili Chan
- Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Steven G Coca
- Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Catherine K Craven
- Institute for Healthcare Delivery Science, Dept. of Pop. Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, USA; Clinical Informatics Group, IT Department, Mount Sinai Health System, New York, USA
| | - Ron Do
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Stephen B Ellis
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Joseph L Kannry
- Information Technology, Mount Sinai Medical Center, New York, USA
| | - Ruth J F Loos
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Peter A Bonis
- Division of Gastroenterology, Tufts Medical Center, Boston, USA
| | - Judy Cho
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, USA; Department of Genetics and Genomics, Icahn School of Medicine at Mount Sinai, New York, USA; Division of Gastroenterology, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Girish N Nadkarni
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, USA; Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, USA.
| |
Collapse
|
11
|
Grossman LV, Mitchell EG, Hripcsak G, Weng C, Vawdrey DK. A method for harmonization of clinical abbreviation and acronym sense inventories. J Biomed Inform 2018; 88:62-69. [PMID: 30414475 DOI: 10.1016/j.jbi.2018.11.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 10/24/2018] [Accepted: 11/05/2018] [Indexed: 11/15/2022]
Abstract
BACKGROUND Previous research has developed methods to construct acronym sense inventories from a single institutional corpus. Although beneficial, a sense inventory constructed from a single institutional corpus is not generalizable, because acronyms from different geographic regions and medical specialties vary greatly. OBJECTIVE Develop an automated method to harmonize sense inventories from different regions and specialties towards the development of a comprehensive inventory. METHODS The method involves integrating multiple source sense inventories into one centralized inventory and cross-mapping redundant entries to establish synonymy. To evaluate our method, we integrated 8 well-known source inventories into one comprehensive inventory (or metathesaurus). For both the metathesaurus and its sources, we evaluated the coverage of acronyms and their senses on a corpus of 1 million clinical notes. The corpus came from a different institution, region, and specialty than the source inventories. RESULTS In the evaluation using clinical notes, the metathesaurus demonstrated an acronym (short form) micro-coverage of 94.3%, representing a substantial increase over the two next largest source inventories, the UMLS LRABR (74.8%) and ADAM (68.0%). The metathesaurus demonstrated a sense (long form) micro-coverage of 99.6%, again a substantial increase compared to the UMLS LRABR (82.5%) and ADAM (55.4%). CONCLUSIONS Given the high coverage, harmonizing acronym sense inventories is a promising methodology to improve their comprehensiveness. Our method is automated, leverages the extensive resources already devoted to developing institution-specific inventories in the United States, and may help generalize sense inventories to institutions who lack the resources to develop them. Future work should address quality issues in source inventories and explore additional approaches to establishing synonymy.
Collapse
Affiliation(s)
- Lisa V Grossman
- Department of Biomedical Informatics, Columbia University, New York, NY, USA; College of Physicians and Surgeons, Columbia University, New York, NY, USA.
| | - Elliot G Mitchell
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - David K Vawdrey
- Department of Biomedical Informatics, Columbia University, New York, NY, USA; Value Institute, NewYork-Presbyterian Hospital, New York, NY, USA
| |
Collapse
|
12
|
Lee J, Song HJ, Yoon E, Park SB, Park SH, Seo JW, Park P, Choi J. Automated extraction of Biomarker information from pathology reports. BMC Med Inform Decis Mak 2018; 18:29. [PMID: 29783980 PMCID: PMC5963015 DOI: 10.1186/s12911-018-0609-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2017] [Accepted: 04/27/2018] [Indexed: 02/06/2023] Open
Abstract
Background Pathology reports are written in free-text form, which precludes efficient data gathering. We aimed to overcome this limitation and design an automated system for extracting biomarker profiles from accumulated pathology reports. Methods We designed a new data model for representing biomarker knowledge. The automated system parses immunohistochemistry reports based on a “slide paragraph” unit defined as a set of immunohistochemistry findings obtained for the same tissue slide. Pathology reports are parsed using context-free grammar for immunohistochemistry, and using a tree-like structure for surgical pathology. The performance of the approach was validated on manually annotated pathology reports of 100 randomly selected patients managed at Seoul National University Hospital. Results High F-scores were obtained for parsing biomarker name and corresponding test results (0.999 and 0.998, respectively) from the immunohistochemistry reports, compared to relatively poor performance for parsing surgical pathology findings. However, applying the proposed approach to our single-center dataset revealed information on 221 unique biomarkers, which represents a richer result than biomarker profiles obtained based on the published literature. Owing to the data representation model, the proposed approach can associate biomarker profiles extracted from an immunohistochemistry report with corresponding pathology findings listed in one or more surgical pathology reports. Term variations are resolved by normalization to corresponding preferred terms determined by expanded dictionary look-up and text similarity-based search. Conclusions Our proposed approach for biomarker data extraction addresses key limitations regarding data representation and can handle reports prepared in the clinical setting, which often contain incomplete sentences, typographical errors, and inconsistent formatting. Electronic supplementary material The online version of this article (10.1186/s12911-018-0609-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jeongeun Lee
- Interdisciplinary Program for Bioengineering, Graduate School, Seoul National Universty, Seoul, Republic of Korea
| | - Hyun-Je Song
- School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea
| | - Eunsil Yoon
- PAS1 team, TmaxSoft, Gyeonggi-do, Republic of Korea
| | - Seong-Bae Park
- School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea
| | - Sung-Hye Park
- Department of Pathology, College of Medicine, Seoul National University, Seoul, Republic of Korea
| | - Jeong-Wook Seo
- Department of Pathology, College of Medicine, Seoul National University, Seoul, Republic of Korea
| | - Peom Park
- Department of Industrial Engineering, Ajou University, Suwon, Republic of Korea
| | - Jinwook Choi
- Interdisciplinary Program for Bioengineering, Graduate School, Seoul National Universty, Seoul, Republic of Korea. .,Department of Biomedical Engineering, College of Medicine, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
13
|
Finley GP, Pakhomov SVS, McEwan R, Melton GB. Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:560-569. [PMID: 28269852 PMCID: PMC5333249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Abbreviation disambiguation in clinical texts is a problem handled well by fully supervised machine learning methods. Acquiring training data, however, is expensive and would be impractical for large numbers of abbreviations in specialized corpora. An alternative is a semi-supervised approach, in which training data are automatically generated by substituting long forms in natural text with their corresponding abbreviations. Most prior implementations of this method either focus on very few abbreviations or do not test on real-world data. We present a realistic use case by testing several semi-supervised classification algorithms on a large hand-annotated medical record of occurrences of 74 ambiguous abbreviations. Despite notable differences between training and test corpora, classifiers achieve up to 90% accuracy. Our tests demonstrate that semi-supervised abbreviation disambiguation is a viable and extensible option for medical NLP systems.
Collapse
Affiliation(s)
| | - Serguei V S Pakhomov
- Institute for Health Informatics; College of Pharmacy University of Minnesota, Minneapolis, MN
| | | | | |
Collapse
|
14
|
Elizabeth Workman T, Weir C, Rindflesch TC. Differentiating Sense through Semantic Interaction Data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:1238-1247. [PMID: 28269921 PMCID: PMC5333208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Words which have different representations but are semantically related, such as dementia and delirium, can pose difficult issues in understanding text. We explore the use of interaction frequency data between semantic elements as a means to differentiate concept pairs, using semantic predications extracted from the biomedical literature. We applied datasets of features drawn from semantic predications for semantically related pairs to two Expectation Maximization clustering processes (without, and with concept labels), then used all data to train and evaluate several concept classifying algorithms. For the unlabeled datasets, 80% displayed expected cluster count and similar or matching proportions; all labeled data exhibited similar or matching proportions when restricting cluster count to unique labels. The highest performing classifier achieved 89% accuracy, with F1 scores for individual concept classification ranging from 0.69 to 1. We conclude with a discussion on how these findings may be applied to natural language processing of clinical text.
Collapse
Affiliation(s)
- T Elizabeth Workman
- VA Salt Lake City Health Care, Salt Lake City, Utah; Division of Epidemiology, University of Utah, Salt Lake City, UT
| | - Charlene Weir
- VA Salt Lake City Health Care, Salt Lake City, Utah; Department of Biomedical Informatics, University of Utah, Salt Lake City, UT
| | - Thomas C Rindflesch
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, National Institutes of Health, Bethesda, MD
| |
Collapse
|
15
|
Névéol A, Zweigenbaum P. Clinical Natural Language Processing in 2015: Leveraging the Variety of Texts of Clinical Interest. Yearb Med Inform 2016; 25:234-239. [PMID: 27830256 PMCID: PMC5171575 DOI: 10.15265/iy-2016-049] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
OBJECTIVE To summarize recent research and present a selection of the best papers published in 2015 in the field of clinical Natural Language Processing (NLP). METHOD A systematic review of the literature was performed by the two section editors of the IMIA Yearbook NLP section by searching bibliographic databases with a focus on NLP efforts applied to clinical texts or aimed at a clinical outcome. Section editors first selected a shortlist of candidate best papers that were then peer-reviewed by independent external reviewers. RESULTS The clinical NLP best paper selection shows that clinical NLP is making use of a variety of texts of clinical interest to contribute to the analysis of clinical information and the building of a body of clinical knowledge. The full review process highlighted five papers analyzing patient-authored texts or seeking to connect and aggregate multiple sources of information. They provide a contribution to the development of methods, resources, applications, and sometimes a combination of these aspects. CONCLUSIONS The field of clinical NLP continues to thrive through the contributions of both NLP researchers and healthcare professionals interested in applying NLP techniques to impact clinical practice. Foundational progress in the field makes it possible to leverage a larger variety of texts of clinical interest for healthcare purposes.
Collapse
Affiliation(s)
- A Névéol
- Aurélie Névéol, LIMSI CNRS UPR 3251, Université Paris Saclay, Rue John von Neumann, 91400 Orsay, France, E-mail:
| | - P Zweigenbaum
- Pierre Zweigenbaum, LIMSI CNRS UPR 3251, Université Paris Saclay, Rue John von Neumann, 91400 Orsay, France, E-mail:
| |
Collapse
|
16
|
Cheng TO. Acrimonious acronymania. Int J Cardiol 2015; 201:663-7. [DOI: 10.1016/j.ijcard.2015.08.137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|