1
|
McMurry AJ, Zipursky AR, Geva A, Olson KL, Jones JR, Ignatov V, Miller TA, Mandl KD. Moving Biosurveillance Beyond Coded Data Using AI for Symptom Detection From Physician Notes: Retrospective Cohort Study. J Med Internet Res 2024; 26:e53367. [PMID: 38573752 PMCID: PMC11027052 DOI: 10.2196/53367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 11/30/2023] [Accepted: 02/27/2024] [Indexed: 04/05/2024] Open
Abstract
BACKGROUND Real-time surveillance of emerging infectious diseases necessitates a dynamically evolving, computable case definition, which frequently incorporates symptom-related criteria. For symptom detection, both population health monitoring platforms and research initiatives primarily depend on structured data extracted from electronic health records. OBJECTIVE This study sought to validate and test an artificial intelligence (AI)-based natural language processing (NLP) pipeline for detecting COVID-19 symptoms from physician notes in pediatric patients. We specifically study patients presenting to the emergency department (ED) who can be sentinel cases in an outbreak. METHODS Subjects in this retrospective cohort study are patients who are 21 years of age and younger, who presented to a pediatric ED at a large academic children's hospital between March 1, 2020, and May 31, 2022. The ED notes for all patients were processed with an NLP pipeline tuned to detect the mention of 11 COVID-19 symptoms based on Centers for Disease Control and Prevention (CDC) criteria. For a gold standard, 3 subject matter experts labeled 226 ED notes and had strong agreement (F1-score=0.986; positive predictive value [PPV]=0.972; and sensitivity=1.0). F1-score, PPV, and sensitivity were used to compare the performance of both NLP and the International Classification of Diseases, 10th Revision (ICD-10) coding to the gold standard chart review. As a formative use case, variations in symptom patterns were measured across SARS-CoV-2 variant eras. RESULTS There were 85,678 ED encounters during the study period, including 4% (n=3420) with patients with COVID-19. NLP was more accurate at identifying encounters with patients that had any of the COVID-19 symptoms (F1-score=0.796) than ICD-10 codes (F1-score =0.451). NLP accuracy was higher for positive symptoms (sensitivity=0.930) than ICD-10 (sensitivity=0.300). However, ICD-10 accuracy was higher for negative symptoms (specificity=0.994) than NLP (specificity=0.917). Congestion or runny nose showed the highest accuracy difference (NLP: F1-score=0.828 and ICD-10: F1-score=0.042). For encounters with patients with COVID-19, prevalence estimates of each NLP symptom differed across variant eras. Patients with COVID-19 were more likely to have each NLP symptom detected than patients without this disease. Effect sizes (odds ratios) varied across pandemic eras. CONCLUSIONS This study establishes the value of AI-based NLP as a highly effective tool for real-time COVID-19 symptom detection in pediatric patients, outperforming traditional ICD-10 methods. It also reveals the evolving nature of symptom prevalence across different virus variants, underscoring the need for dynamic, technology-driven approaches in infectious disease surveillance.
Collapse
Affiliation(s)
- Andrew J McMurry
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
- Department of Pediatrics, Harvard Medical School, Boston, MA, United States
| | - Amy R Zipursky
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
- Division of Pediatric Emergency Medicine, Department of Pediatrics, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alon Geva
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
- Division of Critical Care Medicine, Department of Anesthesiology, Critical Care, and Pain Medicine, Boston Children's Hospital, Boston, MA, United States
- Department of Anaesthesia, Harvard Medical School, Boston, MA, United States
| | - Karen L Olson
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
- Department of Pediatrics, Harvard Medical School, Boston, MA, United States
| | - James R Jones
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
| | - Vladimir Ignatov
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
| | - Timothy A Miller
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
- Department of Pediatrics, Harvard Medical School, Boston, MA, United States
| | - Kenneth D Mandl
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, United States
- Department of Pediatrics, Harvard Medical School, Boston, MA, United States
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| |
Collapse
|
2
|
Xie F, Chang J, Luong T, Wu B, Lustigova E, Shrader E, Chen W. Identifying Symptoms Prior to Pancreatic Ductal Adenocarcinoma Diagnosis in Real-World Care Settings: Natural Language Processing Approach. JMIR AI 2024; 3:e51240. [PMID: 38875566 PMCID: PMC11041417 DOI: 10.2196/51240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 12/08/2023] [Accepted: 12/16/2023] [Indexed: 06/16/2024]
Abstract
BACKGROUND Pancreatic cancer is the third leading cause of cancer deaths in the United States. Pancreatic ductal adenocarcinoma (PDAC) is the most common form of pancreatic cancer, accounting for up to 90% of all cases. Patient-reported symptoms are often the triggers of cancer diagnosis and therefore, understanding the PDAC-associated symptoms and the timing of symptom onset could facilitate early detection of PDAC. OBJECTIVE This paper aims to develop a natural language processing (NLP) algorithm to capture symptoms associated with PDAC from clinical notes within a large integrated health care system. METHODS We used unstructured data within 2 years prior to PDAC diagnosis between 2010 and 2019 and among matched patients without PDAC to identify 17 PDAC-related symptoms. Related terms and phrases were first compiled from publicly available resources and then recursively reviewed and enriched with input from clinicians and chart review. A computerized NLP algorithm was iteratively developed and fine-trained via multiple rounds of chart review followed by adjudication. Finally, the developed algorithm was applied to the validation data set to assess performance and to the study implementation notes. RESULTS A total of 408,147 and 709,789 notes were retrieved from 2611 patients with PDAC and 10,085 matched patients without PDAC, respectively. In descending order, the symptom distribution of the study implementation notes ranged from 4.98% for abdominal or epigastric pain to 0.05% for upper extremity deep vein thrombosis in the PDAC group, and from 1.75% for back pain to 0.01% for pale stool in the non-PDAC group. Validation of the NLP algorithm against adjudicated chart review results of 1000 notes showed that precision ranged from 98.9% (jaundice) to 84% (upper extremity deep vein thrombosis), recall ranged from 98.1% (weight loss) to 82.8% (epigastric bloating), and F1-scores ranged from 0.97 (jaundice) to 0.86 (depression). CONCLUSIONS The developed and validated NLP algorithm could be used for the early detection of PDAC.
Collapse
Affiliation(s)
- Fagen Xie
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Jenny Chang
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Tiffany Luong
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Bechien Wu
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Eva Lustigova
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Eva Shrader
- Pancreatic Cancer Action Network, Manhattan Beach, CA, United States
| | - Wansu Chen
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| |
Collapse
|
3
|
Hanson RF, Zhu V, Are F, Espeleta H, Wallis E, Heider P, Kautz M, Lenert L. Initial development of tools to identify child abuse and neglect in pediatric primary care. BMC Med Inform Decis Mak 2023; 23:266. [PMID: 37978498 PMCID: PMC10656827 DOI: 10.1186/s12911-023-02361-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Accepted: 11/02/2023] [Indexed: 11/19/2023] Open
Abstract
BACKGROUND Child abuse and neglect (CAN) is prevalent, associated with long-term adversities, and often undetected. Primary care settings offer a unique opportunity to identify CAN and facilitate referrals, when warranted. Electronic health records (EHR) contain extensive information to support healthcare decisions, yet time constraints preclude most providers from thorough EHR reviews that could indicate CAN. Strategies that summarize EHR data to identify CAN and convey this to providers has potential to mitigate CAN-related sequelae. This study used expert review/consensus and Natural Language Processing (NLP) to develop and test a lexicon to characterize children who have experienced or are at risk for CAN and compared machine learning methods to the lexicon + NLP approach to determine the algorithm's performance for identifying CAN. METHODS Study investigators identified 90 CAN terms and invited an interdisciplinary group of child abuse experts for review and validation. We then used NLP to develop pipelines to finalize the CAN lexicon. Data for pipeline development and refinement were drawn from a randomly selected sample of EHR from patients seen at pediatric primary care clinics within a U.S. academic health center. To explore a machine learning approach for CAN identification, we used Support Vector Machine algorithms. RESULTS The investigator-generated list of 90 CAN terms were reviewed and validated by 25 invited experts, resulting in a final pool of 133 terms. NLP utilized a randomly selected sample of 14,393 clinical notes from 153 patients to test the lexicon, and .03% of notes were identified as CAN positive. CAN identification varied by clinical note type, with few differences found by provider type (physicians versus nurses, social workers, etc.). An evaluation of the final NLP pipelines indicated 93.8% positive CAN rate for the training set and 71.4% for the test set, with decreased precision attributed primarily to false positives. For the machine learning approach, SVM pipeline performance was 92% for CAN + and 100% for non-CAN, indicating higher sensitivity than specificity. CONCLUSIONS The NLP algorithm's development and refinement suggest that innovative tools can identify youth at risk for CAN. The next key step is to refine the NLP algorithm to eventually funnel this information to care providers to guide clinical decision making.
Collapse
Affiliation(s)
| | - Vivienne Zhu
- Medical University of South Carolina, Charleston, SC, USA
| | | | | | | | - Paul Heider
- Medical University of South Carolina, Charleston, SC, USA
| | - Marin Kautz
- Medical University of South Carolina, Charleston, SC, USA
| | - Leslie Lenert
- Medical University of South Carolina, Charleston, SC, USA
| |
Collapse
|
4
|
Michalski AA, Lis K, Stankiewicz J, Kloska SM, Sycz A, Dudziński M, Muras-Szwedziak K, Nowicki M, Bazan-Socha S, Dabrowski MJ, Basak GW. Supporting the Diagnosis of Fabry Disease Using a Natural Language Processing-Based Approach. J Clin Med 2023; 12:jcm12103599. [PMID: 37240705 DOI: 10.3390/jcm12103599] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 05/01/2023] [Accepted: 05/15/2023] [Indexed: 05/28/2023] Open
Abstract
In clinical practice, the consideration of non-specific symptoms of rare diseases in order to make a correct and timely diagnosis is often challenging. To support physicians, we developed a decision-support scoring system on the basis of retrospective research. Based on the literature and expert knowledge, we identified clinical features typical for Fabry disease (FD). Natural language processing (NLP) was used to evaluate patients' electronic health records (EHRs) to obtain detailed information about FD-specific patient characteristics. The NLP-determined elements, laboratory test results, and ICD-10 codes were transformed and grouped into pre-defined FD-specific clinical features that were scored in the context of their significance in the FD signs. The sum of clinical feature scores constituted the FD risk score. Then, medical records of patients with the highest FD risk score were reviewed by physicians who decided whether to refer a patient for additional tests or not. One patient who obtained a high-FD risk score was referred for DBS assay and confirmed to have FD. The presented NLP-based, decision-support scoring system achieved AUC of 0.998, which demonstrates that the applied approach enables for accurate identification of FD-suspected patients, with a high discrimination power.
Collapse
Affiliation(s)
- Adrian A Michalski
- Saventic Health, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Analytical Chemistry, Nicolaus Copernicus University Ludwik Rydygier Collegium Medicum, 85-089 Bydgoszcz, Poland
| | - Karol Lis
- Saventic Health, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Hematology, Transplantation and Internal Medicine, Medical University of Warsaw, 02-097 Warsaw, Poland
| | - Joanna Stankiewicz
- Saventic Health, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Pediatrics, Hematology and Oncology, Nicolaus Copernicus University Ludwik Rydygier Collegium Medicum, 85-094 Bydgoszcz, Poland
| | - Sylwester M Kloska
- Saventic Health, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Forensic Medicine, Nicolaus Copernicus University Ludwik Rydygier Collegium Medicum, 85-067 Bydgoszcz, Poland
| | - Arkadiusz Sycz
- Saventic Health, Polna 66/12 Street, 87-100 Torun, Poland
- Faculty of Mathematics and Information Science, Warsaw University of Technology, 00-662 Warsaw, Poland
| | - Marek Dudziński
- Saventic Health, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Hematology, Institute of Medical Sciences, College of Medical Sciences, University of Rzeszow, 35-959 Rzeszow, Poland
| | - Katarzyna Muras-Szwedziak
- Saventic Foundation, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Nephrology, Hypertension and Kidney Transplantation, Medical University of Lodz, 90-419 Lodz, Poland
| | - Michał Nowicki
- Saventic Foundation, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Nephrology, Hypertension and Kidney Transplantation, Medical University of Lodz, 90-419 Lodz, Poland
| | - Stanisława Bazan-Socha
- Saventic Foundation, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Internal Medicine, Faculty of Medicine, Jagiellonian University Medical College, 31-008 Krakow, Poland
| | - Michal J Dabrowski
- Saventic Health, Polna 66/12 Street, 87-100 Torun, Poland
- Computational Biology Group, Institute of Computer Science of the Polish Academy of Sciences, 01-248 Warsaw, Poland
| | - Grzegorz W Basak
- Saventic Health, Polna 66/12 Street, 87-100 Torun, Poland
- Department of Hematology, Transplantation and Internal Medicine, Medical University of Warsaw, 02-097 Warsaw, Poland
| |
Collapse
|
5
|
Alshahrani SM, Khan NA. COVID-19 advising application development for Apple devices (iOS). PeerJ Comput Sci 2023; 9:e1274. [PMID: 37346730 PMCID: PMC10280587 DOI: 10.7717/peerj-cs.1274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Accepted: 02/13/2023] [Indexed: 06/23/2023]
Abstract
One of humanity's most devastating health crises was COVID-19. Billions of people suffered during this pandemic. In comparison with previous global pandemics that have been faced by the world before, societies were more accurate with the technical support system during this natural disaster. The intersection of data from healthcare units and the analysis of this data into various sophisticated systems were critical factors. Different healthcare units have taken special consideration to advance technical inputs to fight against such situations. The field of natural language processing (NLP) has dramatically supported this. Despite the primitive methods for monitoring the bio-metric factors of a person, the use of cognitive science has emerged as one of the most critical features during this pandemic era. One of the essential features is the potential to understand the data based on various texts and user inputs. The deployment of various NLP systems is one of the most challenging factors in handling the bulk amount of data flowing from multiple sources. This study focused on developing a powerful application to advise patients suffering from ailments related to COVID-19. The use of NLP refers to facilitating a user to identify the present critical situation and make necessary decisions while getting infected. This article also summarises the challenges associated with NLP and its usage for future NLP-based applications focusing on healthcare units. There are a couple of applications that reside for android-based systems as well as web-based chat-bot systems. In terms of security and safety, application development for iOS is more advanced. This study also explains the block meant of an application for advising COVID-19 infection. A natural language processing powered application for an iOS operating system is indeed one of its kind, which will help people who need to advise proper guidance. The article also portrays NLP-based application development for healthcare problems associated with personal reporting systems.
Collapse
Affiliation(s)
- Saeed M. Alshahrani
- Department of Computer Science, College of Computing and Information Technology, Shaqra University, Shaqra, Riyadh, Saudi Arabia
| | - Nayyar Ahmed Khan
- Department of Computer Science, College of Computing and Information Technology, Shaqra University, Shaqra, Riyadh, Saudi Arabia
| |
Collapse
|
6
|
Wang L, Foer D, Zhang Y, Karlson EW, Bates DW, Zhou L. Post-Acute COVID-19 Respiratory Symptoms in Patients With Asthma: An Electronic Health Records-Based Study. THE JOURNAL OF ALLERGY AND CLINICAL IMMUNOLOGY. IN PRACTICE 2023; 11:825-835.e3. [PMID: 36566779 PMCID: PMC9773736 DOI: 10.1016/j.jaip.2022.12.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 11/27/2022] [Accepted: 12/01/2022] [Indexed: 12/24/2022]
Abstract
BACKGROUND Post-viral respiratory symptoms are common among patients with asthma. Respiratory symptoms after acute COVID-19 are widely reported in the general population, but large-scale studies identifying symptom risk for patients with asthma are lacking. OBJECTIVE To identify and compare risk for post-acute COVID-19 respiratory symptoms in patients with and without asthma. METHODS This retrospective, observational cohort study included COVID-19-positive patients between March 4, 2020, and January 20, 2021, with up to 180 days of health care follow-up in a health care system in the Northeastern United States. Respiratory symptoms recorded in clinical notes from days 28 to 180 after COVID-19 diagnosis were extracted using natural language processing. Cohorts were stratified by hospitalization status during the acute COVID-19 period. Univariable and multivariable analyses were used to compare symptoms among patients with and without asthma adjusting for demographic and clinical confounders. RESULTS Among 31,084 eligible patients with COVID-19, 2863 (9.2%) had hospitalization during the acute COVID-19 period; 4049 (13.0%) had a history of asthma, accounting for 13.8% of hospitalized and 12.9% of nonhospitalized patients. In the post-acute COVID-19 period, patients with asthma had significantly higher risk of shortness of breath, cough, bronchospasm, and wheezing than patients without an asthma history. Incident respiratory symptoms of bronchospasm and wheezing were also higher in patients with asthma. Patients with asthma who had not been hospitalized during acute COVID-19 had additionally higher risk of cough, abnormal breathing, sputum changes, and a wider range of incident respiratory symptoms. CONCLUSION Patients with asthma may have an under-recognized burden of respiratory symptoms after COVID-19 warranting increased awareness and monitoring in this population.
Collapse
Affiliation(s)
- Liqin Wang
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Mass.
| | - Dinah Foer
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Mass; Division of Allergy and Clinical Immunology, Department of Medicine, Brigham and Women's Hospital, Boston, Mass
| | - Yuqing Zhang
- Division of Rheumatology, Allergy, and Immunology, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, Mass
| | - Elizabeth W Karlson
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Mass
| | - David W Bates
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Mass
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, Mass
| |
Collapse
|
7
|
Jeon E, Kim A, Lee J, Heo H, Lee H, Woo K. Developing a Classification Algorithm for Prediabetes Risk Detection From Home Care Nursing Notes: Using Natural Language Processing. Comput Inform Nurs 2023:00024665-990000000-00087. [PMID: 37165830 DOI: 10.1097/cin.0000000000001000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
This study developed and validated a rule-based classification algorithm for prediabetes risk detection using natural language processing from home care nursing notes. First, we developed prediabetes-related symptomatic terms in English and Korean. Second, we used natural language processing to preprocess the notes. Third, we created a rule-based classification algorithm with 31 484 notes, excluding 315 instances of missing data. The final algorithm was validated by measuring accuracy, precision, recall, and the F1 score against a gold standard testing set (400 notes). The developed terms comprised 11 categories and 1639 words in Korean and 1181 words in English. Using the rule-based classification algorithm, 42.2% of the notes comprised one or more prediabetic symptoms. The algorithm achieved high performance when applied to the gold standard testing set. We proposed a rule-based natural language processing algorithm to optimize the classification of the prediabetes risk group, depending on whether the home care nursing notes contain prediabetes-related symptomatic terms. Tokenization based on white space and the rule-based algorithm were brought into effect to detect the prediabetes symptomatic terms. Applying this algorithm to electronic health records systems will increase the possibility of preventing diabetes onset through early detection of risk groups and provision of tailored intervention.
Collapse
Affiliation(s)
- Eunjoo Jeon
- Author Affiliations: Technology Research, SamsungSDS (Dr Jeon); College of Nursing, Seoul National University (Mss Kim, J. Lee, and H. Lee and Dr Woo); and Seoul National University Hospital (Ms Heo), Seoul, South Korea
| | | | | | | | | | | |
Collapse
|
8
|
Kumar A, Sharaff A. PubExN: An Automated PubMed Bulk Article Extractor with Affiliation Normalization Package. SN COMPUTER SCIENCE 2023; 4:353. [PMID: 37128512 PMCID: PMC10132428 DOI: 10.1007/s42979-023-01687-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/10/2022] [Accepted: 01/11/2023] [Indexed: 05/03/2023]
Abstract
Biomedical article extraction is the preliminary step for every biomedical application. These applications are helpful in finding the gene, disease, chemical, drugs, protein entities. Finding entities relation such as gene-gene entities, drug-disease interaction, and chemical protein relation the PubExN can be helpful for these types of biomedical applications. In most cases, domain experts do this extraction process on their own. Human interference makes this process time-consuming and there is a high probability, that documents can be missed during the extraction process. To get rid of these complicated processes a python package is introduced to automate the process of bulk extraction from the PubMed database. The extraction process covers all the citation information with the associated abstract. The batch approach is used to extract the bulk extraction. The motivation for the development of PubExN was to provide flexibility for the extraction process of biomedical article's text data from NCBI's PubMed database. Basically, NCBI's PubMed database article contains the article id or can say PubMed-id (PMID), the title of the article, abstract, authors information, etc. This package will benefit many biomedical texts mining research including biomedical named entity recognition, biomedical relation extraction, literature discovery, knowledgebase creation, and various biomedical Natural Language Processing (NLP) tasks. In addition, it could be used in the author name disambiguation problems and new drug discoveries. This package will help save time and extra effort for the extraction and normalization process of PubMed articles.
Collapse
Affiliation(s)
- Ashutosh Kumar
- Department of Computer Science and Engineering, National Institute of Technology Raipur, G. E. Road, Raipur, 492001 Chhattisgarh India
| | - Aakanksha Sharaff
- Department of Computer Science and Engineering, National Institute of Technology Raipur, G. E. Road, Raipur, 492001 Chhattisgarh India
| |
Collapse
|
9
|
Mavragani A, Sanchez T, Ackerson BK, Hong V, Skarbinski J, Yau V, Qian L, Fischer H, Shaw SF, Caparosa S, Xie F. Natural Language Processing for Improved Characterization of COVID-19 Symptoms: Observational Study of 350,000 Patients in a Large Integrated Health Care System. JMIR Public Health Surveill 2022; 8:e41529. [PMID: 36446133 PMCID: PMC9822566 DOI: 10.2196/41529] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 11/07/2022] [Accepted: 11/29/2022] [Indexed: 12/05/2022] Open
Abstract
BACKGROUND Natural language processing (NLP) of unstructured text from electronic medical records (EMR) can improve the characterization of COVID-19 signs and symptoms, but large-scale studies demonstrating the real-world application and validation of NLP for this purpose are limited. OBJECTIVE The aim of this paper is to assess the contribution of NLP when identifying COVID-19 signs and symptoms from EMR. METHODS This study was conducted in Kaiser Permanente Southern California, a large integrated health care system using data from all patients with positive SARS-CoV-2 laboratory tests from March 2020 to May 2021. An NLP algorithm was developed to extract free text from EMR on 12 established signs and symptoms of COVID-19, including fever, cough, headache, fatigue, dyspnea, chills, sore throat, myalgia, anosmia, diarrhea, vomiting or nausea, and abdominal pain. The proportion of patients reporting each symptom and the corresponding onset dates were described before and after supplementing structured EMR data with NLP-extracted signs and symptoms. A random sample of 100 chart-reviewed and adjudicated SARS-CoV-2-positive cases were used to validate the algorithm performance. RESULTS A total of 359,938 patients (mean age 40.4 [SD 19.2] years; 191,630/359,938, 53% female) with confirmed SARS-CoV-2 infection were identified over the study period. The most common signs and symptoms identified through NLP-supplemented analyses were cough (220,631/359,938, 61%), fever (185,618/359,938, 52%), myalgia (153,042/359,938, 43%), and headache (144,705/359,938, 40%). The NLP algorithm identified an additional 55,568 (15%) symptomatic cases that were previously defined as asymptomatic using structured data alone. The proportion of additional cases with each selected symptom identified in NLP-supplemented analysis varied across the selected symptoms, from 29% (63,742/220,631) of all records for cough to 64% (38,884/60,865) of all records with nausea or vomiting. Of the 295,305 symptomatic patients, the median time from symptom onset to testing was 3 days using structured data alone, whereas the NLP algorithm identified signs or symptoms approximately 1 day earlier. When validated against chart-reviewed cases, the NLP algorithm successfully identified signs and symptoms with consistently high sensitivity (ranging from 87% to 100%) and specificity (94% to 100%). CONCLUSIONS These findings demonstrate that NLP can identify and characterize a broad set of COVID-19 signs and symptoms from unstructured EMR data with enhanced detail and timeliness compared with structured data alone.
Collapse
Affiliation(s)
| | | | - Bradley K Ackerson
- Southern California Permanente Medical Group, Harbor City, CA, United States
| | - Vennis Hong
- Department of Research & Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Jacek Skarbinski
- The Permanente Medical Group, Kaiser Permanente Northern California, Oakland, CA, United States.,Division of Research, Kaiser Permanente Northern California, Oakland, CA, United States
| | - Vincent Yau
- Genentech, a Member of the Roche Group, San Francisco, CA, United States
| | - Lei Qian
- Department of Research & Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Heidi Fischer
- Department of Research & Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Sally F Shaw
- Department of Research & Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Susan Caparosa
- Department of Research & Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| | - Fagen Xie
- Department of Research & Evaluation, Kaiser Permanente Southern California, Pasadena, CA, United States
| |
Collapse
|
10
|
Al-Garadi MA, Yang YC, Sarker A. The Role of Natural Language Processing during the COVID-19 Pandemic: Health Applications, Opportunities, and Challenges. Healthcare (Basel) 2022; 10:2270. [PMID: 36421593 PMCID: PMC9690240 DOI: 10.3390/healthcare10112270] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 11/03/2022] [Accepted: 11/06/2022] [Indexed: 07/30/2023] Open
Abstract
The COVID-19 pandemic is the most devastating public health crisis in at least a century and has affected the lives of billions of people worldwide in unprecedented ways. Compared to pandemics of this scale in the past, societies are now equipped with advanced technologies that can mitigate the impacts of pandemics if utilized appropriately. However, opportunities are currently not fully utilized, particularly at the intersection of data science and health. Health-related big data and technological advances have the potential to significantly aid the fight against such pandemics, including the current pandemic's ongoing and long-term impacts. Specifically, the field of natural language processing (NLP) has enormous potential at a time when vast amounts of text-based data are continuously generated from a multitude of sources, such as health/hospital systems, published medical literature, and social media. Effectively mitigating the impacts of the pandemic requires tackling challenges associated with the application and deployment of NLP systems. In this paper, we review the applications of NLP to address diverse aspects of the COVID-19 pandemic. We outline key NLP-related advances on a chosen set of topics reported in the literature and discuss the opportunities and challenges associated with applying NLP during the current pandemic and future ones. These opportunities and challenges can guide future research aimed at improving the current health and social response systems and pandemic preparedness.
Collapse
Affiliation(s)
- Mohammed Ali Al-Garadi
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37240, USA
| | - Yuan-Chi Yang
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA 30322, USA
| | - Abeed Sarker
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA 30322, USA
| |
Collapse
|
11
|
A Natural Language Processing (NLP) Evaluation on COVID-19 Rumour Dataset Using Deep Learning Techniques. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:6561622. [PMID: 36156967 PMCID: PMC9492356 DOI: 10.1155/2022/6561622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 06/18/2022] [Accepted: 07/22/2022] [Indexed: 11/17/2022]
Abstract
Context and Background: Since December 2019, the coronavirus (COVID-19) epidemic has sparked considerable alarm among the general community and significantly affected societal attitudes and perceptions. Apart from the disease itself, many people suffer from anxiety and depression due to the disease and the present threat of an outbreak. Due to the fast propagation of the virus and misleading/fake information, the issues of public discourse alter, resulting in significant confusion in certain places. Rumours are unproven facts or stories that propagate and promote sentiments of prejudice, hatred, and fear. Objective. The study’s objective is to propose a novel solution to detect fake news using state-of-the-art machines and deep learning models. Furthermore, to analyse which models outperformed in detecting the fake news. Method. In the research study, we adapted a COVID-19 rumours dataset, which incorporates rumours from news websites and tweets, together with information about the rumours. It is important to analyse data utilizing Natural Language Processing (NLP) and Deep Learning (DL) approaches. Based on the accuracy, precision, recall, and the f1 score, we can assess the effectiveness of the ML and DL algorithms. Results. The data adopted from the source (mentioned in the paper) have collected 9200 comments from Google and 34,779 Twitter postings filtered for phrases connected with COVID-19-related fake news. Experiment 1. The dataset was assessed using the following three criteria: veracity, stance, and sentiment. In these terms, we have different labels, and we have applied the DL algorithms separately to each term. We have used different models in the experiment such as (i) LSTM and (ii) Temporal Convolution Networks (TCN). The TCN model has more performance on each measurement parameter in the evaluated results. So, we have used the TCN model for the practical implication for better findings. Experiment 2. In the second experiment, we have used different state-of-the-art deep learning models and algorithms such as (i) Simple RNN; (ii) LSTM + Word Embedding; (iii) Bidirectional + Word Embedding; (iv) LSTM + CNN-1D; and (v) BERT. Furthermore, we have evaluated the performance of these models on all three datasets, e.g., veracity, stance, and sentiment. Based on our second experimental evaluation, the BERT has a superior performance over the other models compared.
Collapse
|
12
|
Shapiro M, Landau R, Shay S, Kaminski M, Verhovsky G. Early Detection of COVID-19 outbreaks using Textual Analysis of Electronic Medical Records. J Clin Virol 2022; 155:105251. [PMID: 35973330 PMCID: PMC9347140 DOI: 10.1016/j.jcv.2022.105251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 07/10/2022] [Accepted: 08/02/2022] [Indexed: 11/26/2022]
Abstract
Purpose Our objective was to develop a tool promoting early detection of COVID-19 cases by focusing epidemiological investigations and PCR examinations during a period of limited testing capabilities. Methods We developed an algorithm for analyzing medical records recorded by healthcare providers in the Israeli Defense Forces. The algorithm utilized textual analysis to detect patients presenting with suspicious symptoms and was tested among 92 randomly selected units. Detection of a potential cluster of patients in a unit prompted a focused epidemiological investigation aided by data provided by the algorithm. Results During a month of follow up, the algorithm has flagged 17 of the units for investigation. The subsequent epidemiological investigations led to the testing of 78 persons and the detection of eight cases in four clusters that were previously gone unnoticed. The resulting positive test rate of 10.25% was five time higher than the IDF average at the time of the study. No cases of COVID-19 in the examined units were missed by the algorithm. Conclusions This study depicts the successful development and large scale deployment of a textual analysis based algorithm for early detection of COVID-19 cases, demonstrating the potential of natural language processing of medical text as a tool for promoting public health.
Collapse
|
13
|
Faris H, Faris M, Habib M, Alomari A. Automatic symptoms identification from a massive volume of unstructured medical consultations using deep neural and BERT models. Heliyon 2022; 8:e09683. [PMID: 35761935 PMCID: PMC9233221 DOI: 10.1016/j.heliyon.2022.e09683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 04/10/2022] [Accepted: 06/01/2022] [Indexed: 11/25/2022] Open
Abstract
Automatic symptom identification plays a crucial role in assisting doctors during the diagnosis process in Telemedicine. In general, physicians spend considerable time on clinical documentation and symptom identification, which is unfeasible due to their full schedule. With text-based consultation services in telemedicine, the identification of symptoms from a user's consultation is a sophisticated process and time-consuming. Moreover, at Altibbi, which is an Arabic telemedicine platform and the context of this work, users consult doctors and describe their conditions in different Arabic dialects which makes the problem more complex and challenging. Therefore, in this work, an advanced deep learning approach is developed consultations with multi-dialects. The approach is formulated as a multi-label multi-class classification using features extracted based on AraBERT and fine-tuned on the bidirectional long short-term memory (BiLSTM) network. The Fine-tuning of BiLSTM relies on features engineered based on different variants of the bidirectional encoder representations from transformers (BERT). Evaluating the models based on precision, recall, and a customized hit rate showed a successful identification of symptoms from Arabic texts with promising accuracy. Hence, this paves the way toward deploying an automated symptom identification model in production at Altibbi which can help general practitioners in telemedicine in providing more efficient and accurate consultations.
Collapse
Affiliation(s)
- Hossam Faris
- King Abdullah II School for Information Technology, The University of Jordan, 11942, Jordan.,Research Centre for Information and Communications Technologies of the University of Granada (CITIC-UGR), University of Granada, Granada, Spain.,Altibbi1https://altibbi.com., Amman, Jordan
| | | | | | - Alaa Alomari
- Altibbi1https://altibbi.com., Amman, Jordan.,School of Informatics and Telecommunications Engineering, University of Granada, Granada, Spain
| |
Collapse
|
14
|
Knosp BM, Craven CK, Dorr DA, Bernstam EV, Campion TR. Understanding enterprise data warehouses to support clinical and translational research: enterprise information technology relationships, data governance, workforce, and cloud computing. J Am Med Inform Assoc 2022; 29:671-676. [PMID: 35289370 PMCID: PMC8922193 DOI: 10.1093/jamia/ocab256] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Accepted: 11/05/2021] [Indexed: 01/22/2023] Open
Abstract
OBJECTIVE Among National Institutes of Health Clinical and Translational Science Award (CTSA) hubs, effective approaches for enterprise data warehouses for research (EDW4R) development, maintenance, and sustainability remain unclear. The goal of this qualitative study was to understand CTSA EDW4R operations within the broader contexts of academic medical centers and technology. MATERIALS AND METHODS We performed a directed content analysis of transcripts generated from semistructured interviews with informatics leaders from 20 CTSA hubs. RESULTS Respondents referred to services provided by health system, university, and medical school information technology (IT) organizations as "enterprise information technology (IT)." Seventy-five percent of respondents stated that the team providing EDW4R service at their hub was separate from enterprise IT; strong relationships between EDW4R teams and enterprise IT were critical for success. Managing challenges of EDW4R staffing was made easier by executive leadership support. Data governance appeared to be a work in progress, as most hubs reported complex and incomplete processes, especially for commercial data sharing. Although nearly all hubs (n = 16) described use of cloud computing for specific projects, only 2 hubs reported using a cloud-based EDW4R. Respondents described EDW4R cloud migration facilitators, barriers, and opportunities. DISCUSSION Descriptions of approaches to how EDW4R teams at CTSA hubs work with enterprise IT organizations, manage workforces, make decisions about data, and approach cloud computing provide insights for institutions seeking to leverage patient data for research. CONCLUSION Identification of EDW4R best practices is challenging, and this study helps identify a breadth of viable options for CTSA hubs to consider when implementing EDW4R services.
Collapse
Affiliation(s)
- Boyd M Knosp
- Roy J. and Lucille A. Carver College of Medicine and the Institute for Clinical & Translational Science, University of Iowa, Iowa City, Iowa, USA
| | - Catherine K Craven
- Division of Clinical Research Informatics, Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, Texas, USA
| | - David A Dorr
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, Oregon, USA
- Department of Medicine, Oregon Health & Science University, Portland, Oregon, USA
| | - Elmer V Bernstam
- Center for Clinical and Translational Sciences, University of Texas Health Science Center, Houston, Texas, USA
| | - Thomas R Campion
- Clinical & Translational Science Center, Weill Cornell Medicine, New York, New York, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, New York, USA
| |
Collapse
|
15
|
Hasan A, Levene M, Weston D, Fromson R, Koslover N, Levene T. Monitoring Covid-19 on social media using a novel triage and diagnosis approach. J Med Internet Res 2022; 24:e30397. [PMID: 35142636 PMCID: PMC8887561 DOI: 10.2196/30397] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Revised: 07/09/2021] [Accepted: 02/05/2022] [Indexed: 12/23/2022] Open
Abstract
Background The COVID-19 pandemic has created a pressing need for integrating information from disparate sources in order to assist decision makers. Social media is important in this respect; however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. Here, we adopt a triage and diagnosis approach to analyzing social media posts using machine learning techniques for the purpose of disease detection and surveillance. We thus obtain useful prevalence and incidence statistics to identify disease symptoms and their severities, motivated by public health concerns. Objective This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts in order to provide researchers and public health practitioners with additional information on the symptoms, severity, and prevalence of the disease rather than to provide an actionable decision at the individual level. Methods The text processing pipeline first extracted COVID-19 symptoms and related concepts, such as severity, duration, negations, and body parts, from patients’ posts using conditional random fields. An unsupervised rule-based algorithm was then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations were subsequently used to construct 2 different vector representations of each post. These vectors were separately applied to build support vector machine learning models to triage patients into 3 categories and diagnose them for COVID-19. Results We reported macro- and microaveraged F1 scores in the range of 71%-96% and 61%-87%, respectively, for the triage and diagnosis of COVID-19 when the models were trained on human-labeled data. Our experimental results indicated that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. In addition, we highlighted important features uncovered by our diagnostic machine learning models and compared them with the most frequent symptoms revealed in another COVID-19 data set. In particular, we found that the most important features are not always the most frequent ones. Conclusions Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from social media natural language narratives, using a machine learning pipeline in order to provide information on the severity and prevalence of the disease for use within health surveillance systems.
Collapse
Affiliation(s)
- Abul Hasan
- Birkbeck, University of London, Malet street, bloomsbury, London, GB
| | - Mark Levene
- Birkbeck, University of London, Malet street, bloomsbury, London, GB
| | - David Weston
- Birkbeck, University of London, Malet street, bloomsbury, London, GB
| | - Renate Fromson
- Barnet General Hospital, Wellhouse Lane, London EN5 3DJ, United Kingdom, London, GB
| | - Nicolas Koslover
- Barnet General Hospital, Wellhouse Lane, London EN5 3DJ, United Kingdom, London, GB
| | - Tamara Levene
- Barnet General Hospital, Wellhouse Lane, London EN5 3DJ, United Kingdom, London, GB
| |
Collapse
|
16
|
Gupta T, Debele TA, Wei YF, Gupta A, Murtaza M, Su WP. Synergistic Action of Immunotherapy and Nanotherapy against Cancer Patients Infected with SARS-CoV-2 and the Use of Artificial Intelligence. Cancers (Basel) 2022; 14:213. [PMID: 35008377 PMCID: PMC8750412 DOI: 10.3390/cancers14010213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 12/28/2021] [Accepted: 12/30/2021] [Indexed: 01/08/2023] Open
Abstract
Since 2019, the SARS-CoV-2 pandemic has caused a huge chaos throughout the world and the major threat has been possessed by the immune-compromised individuals involving the cancer patients; their weakened immune response makes them vulnerable and susceptible to the virus. The oncologists as well as their patients are facing many problems for their treatment sessions as they need to postpone their surgery, chemotherapy, or radiotherapy. The approach that could be adopted especially for the cancer patients is the amalgamation of immunotherapy and nanotherapy which can reduce the burden on the healthcare at this peak time of the infection. There is also a need to predict or analyze the data of cancer patients who are at a severe risk of being exposed to an infection in order to reduce the mortality rate. The use of artificial intelligence (AI) could be incorporated where the real time data will be available to the physicians according to the different patient's clinical characteristics and their past treatments. With this data, it will become easier for them to modify or replace the treatment to increase the efficacy against the infection. The combination of an immunotherapy and nanotherapy will be targeted to treat the cancer patients diagnosed with SARS-CoV-2 and the AI will act as icing on the cake to monitor, predict and analyze the data of the patients to improve the treatment regime for the most vulnerable patients.
Collapse
Affiliation(s)
- Tanvi Gupta
- Institute of Clinical Medicine, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan;
| | - Tilahun Ayane Debele
- Department of Biomedical, Chemical & Environmental Engineering, College of Engineering and Applied Science (CEAS), University of Cincinnati, Cincinnati, OH 45221, USA;
| | - Yu-Feng Wei
- Department of Internal Medicine, School of Medicine for International Students, College of Medicine, E-Da Cancer Hospital, I-Shou University, Kaohsiung 824, Taiwan;
| | - Anish Gupta
- Devscope IT, First Floor, 40A/B Gandhi Nagar, Jammu 180001, India;
| | - Mohd Murtaza
- Microbial Biotechnology Division, CSIR-Indian Institute of Integrative Medicine, Jammu 180012, India;
| | - Wen-Pin Su
- Institute of Clinical Medicine, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan;
- Departments of Oncology and Internal Medicine, National Cheng Kung University Hospital, College of Medicine, National Cheng Kung University, Tainan 704, Taiwan
- Center of Applied Nanomedicine, National Cheng Kung University, Tainan 704, Taiwan
| |
Collapse
|
17
|
Yin AL, Guo WL, Sholle ET, Rajan M, Alshak MN, Choi JJ, Goyal P, Jabri A, Li HA, Pinheiro LC, Wehmeyer GT, Weiner M, Safford MM, Campion TR, Cole CL. Comparing automated vs. manual data collection for COVID-specific medications from electronic health records. Int J Med Inform 2022; 157:104622. [PMID: 34741892 PMCID: PMC8529289 DOI: 10.1016/j.ijmedinf.2021.104622] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 09/19/2021] [Accepted: 10/15/2021] [Indexed: 12/29/2022]
Abstract
INTRODUCTION Data extraction from electronic health record (EHR) systems occurs through manual abstraction, automated extraction, or a combination of both. While each method has its strengths and weaknesses, both are necessary for retrospective observational research as well as sudden clinical events, like the COVID-19 pandemic. Assessing the strengths, weaknesses, and potentials of these methods is important to continue to understand optimal approaches to extracting clinical data. We set out to assess automated and manual techniques for collecting medication use data in patients with COVID-19 to inform future observational studies that extract data from the electronic health record (EHR). MATERIALS AND METHODS For 4,123 COVID-positive patients hospitalized and/or seen in the emergency department at an academic medical center between 03/03/2020 and 05/15/2020, we compared medication use data of 25 medications or drug classes collected through manual abstraction and automated extraction from the EHR. Quantitatively, we assessed concordance using Cohen's kappa to measure interrater reliability, and qualitatively, we audited observed discrepancies to determine causes of inconsistencies. RESULTS For the 16 inpatient medications, 11 (69%) demonstrated moderate or better agreement; 7 of those demonstrated strong or almost perfect agreement. For 9 outpatient medications, 3 (33%) demonstrated moderate agreement, but none achieved strong or almost perfect agreement. We audited 12% of all discrepancies (716/5,790) and, in those audited, observed three principal categories of error: human error in manual abstraction (26%), errors in the extract-transform-load (ETL) or mapping of the automated extraction (41%), and abstraction-query mismatch (33%). CONCLUSION Our findings suggest many inpatient medications can be collected reliably through automated extraction, especially when abstraction instructions are designed with data architecture in mind. We discuss quality issues, concerns, and improvements for institutions to consider when crafting an approach. During crises, institutions must decide how to allocate limited resources. We show that automated extraction of medications is feasible and make recommendations on how to improve future iterations.
Collapse
Affiliation(s)
- Andrew L. Yin
- Weill Cornell Medical College, Weill Cornell Medicine, New York, NY, United States,Department of Medicine, Weill Cornell Medicine, New York, NY, United States,Corresponding author at: 1300 York Avenue, New York, NY 10021, United States
| | - Winston L. Guo
- Weill Cornell Medical College, Weill Cornell Medicine, New York, NY, United States
| | - Evan T. Sholle
- Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, United States
| | - Mangala Rajan
- Department of Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Mark N. Alshak
- Weill Cornell Medical College, Weill Cornell Medicine, New York, NY, United States,Department of Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Justin J. Choi
- Division of General Internal Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Parag Goyal
- Division of General Internal Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Assem Jabri
- Division of General Internal Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Han A. Li
- Weill Cornell Medical College, Weill Cornell Medicine, New York, NY, United States,Department of Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Laura C. Pinheiro
- Department of Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Graham T. Wehmeyer
- Weill Cornell Medical College, Weill Cornell Medicine, New York, NY, United States,Department of Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Mark Weiner
- Department of Medicine, Weill Cornell Medicine, New York, NY, United States,Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, United States
| | | | - Monika M. Safford
- Division of General Internal Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Thomas R. Campion
- Information Technologies & Services Department, Weill Cornell Medicine, New York, NY, United States,Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, United States,Clinical and Translational Science Center, Weill Cornell Medicine, New York, NY, United States
| | - Curtis L. Cole
- Department of Medicine, Weill Cornell Medicine, New York, NY, United States,Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, United States
| |
Collapse
|
18
|
Wang L, Foer D, MacPhaul E, Lo YC, Bates DW, Zhou L. PASCLex: A comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon derived from electronic health record clinical notes. J Biomed Inform 2022; 125:103951. [PMID: 34785382 PMCID: PMC8590503 DOI: 10.1016/j.jbi.2021.103951] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 11/06/2021] [Accepted: 11/06/2021] [Indexed: 01/04/2023]
Abstract
OBJECTIVE To develop a comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon (PASCLex) from clinical notes to support PASC symptom identification and research. METHODS We identified 26,117 COVID-19 positive patients from the Mass General Brigham's electronic health records (EHR) and extracted 328,879 clinical notes from their post-acute infection period (day 51-110 from first positive COVID-19 test). PASCLex incorporated Unified Medical Language System® (UMLS) Metathesaurus concepts and synonyms based on selected semantic types. The MTERMS natural language processing (NLP) tool was used to automatically extract symptoms from a development dataset. The lexicon was iteratively revised with manual chart review, keyword search, concept consolidation, and evaluation of NLP output. We assessed the comprehensiveness of PASCLex and the NLP performance using a validation dataset and reported the symptom prevalence across the entire corpus. RESULTS PASCLex included 355 symptoms consolidated from 1520 UMLS concepts of 16,466 synonyms. NLP achieved an averaged precision of 0.94 and an estimated recall of 0.84. Symptoms with the highest frequency included pain (43.1%), anxiety (25.8%), depression (24.0%), fatigue (23.4%), joint pain (21.0%), shortness of breath (20.8%), headache (20.0%), nausea and/or vomiting (19.9%), myalgia (19.0%), and gastroesophageal reflux (18.6%). DISCUSSION AND CONCLUSION PASC symptoms are diverse. A comprehensive lexicon of PASC symptoms can be derived using an ontology-driven, EHR-guided and NLP-assisted approach. By using unstructured data, this approach may improve identification and analysis of patient symptoms in the EHR, and inform prospective study design, preventative care strategies, and therapeutic interventions for patient care.
Collapse
Affiliation(s)
- Liqin Wang
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, USA,Harvard Medical School, Boston, MA, USA,Corresponding author at: Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, 399 Revolution Dr, Suite 1315, Somerville, MA 02145, USA
| | - Dinah Foer
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, USA,Harvard Medical School, Boston, MA, USA,Division of Allergy and Clinical Immunology, Department of Medicine, Brigham and Women’s Hospital, USA
| | - Erin MacPhaul
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, USA
| | - Ying-Chih Lo
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, USA,Harvard Medical School, Boston, MA, USA
| | - David W. Bates
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, USA,Harvard Medical School, Boston, MA, USA
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Department of Medicine, Brigham and Women's Hospital, USA,Harvard Medical School, Boston, MA, USA
| |
Collapse
|
19
|
Hripcsak G, Schuemie MJ, Madigan D, Ryan PB, Suchard MA. Drawing Reproducible Conclusions from Observational Clinical Data with OHDSI. Yearb Med Inform 2021; 30:283-289. [PMID: 33882595 PMCID: PMC8416226 DOI: 10.1055/s-0041-1726481] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
OBJECTIVE The current observational research literature shows extensive publication bias and contradiction. The Observational Health Data Sciences and Informatics (OHDSI) initiative seeks to improve research reproducibility through open science. METHODS OHDSI has created an international federated data source of electronic health records and administrative claims that covers nearly 10% of the world's population. Using a common data model with a practical schema and extensive vocabulary mappings, data from around the world follow the identical format. OHDSI's research methods emphasize reproducibility, with a large-scale approach to addressing confounding using propensity score adjustment with extensive diagnostics; negative and positive control hypotheses to test for residual systematic error; a variety of data sources to assess consistency and generalizability; a completely open approach including protocol, software, models, parameters, and raw results so that studies can be externally verified; and the study of many hypotheses in parallel so that the operating characteristics of the methods can be assessed. RESULTS OHDSI has already produced findings in areas like hypertension treatment that are being incorporated into practice, and it has produced rigorous studies of COVID-19 that have aided government agencies in their treatment decisions, that have characterized the disease extensively, that have estimated the comparative effects of treatments, and that the predict likelihood of advancing to serious complications. CONCLUSIONS OHDSI practices open science and incorporates a series of methods to address reproducibility. It has produced important results in several areas, including hypertension therapy and COVID-19 research.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
- Observational Health Data Sciences and Informatics, New York, New York, USA
| | - Martijn J. Schuemie
- Observational Health Data Sciences and Informatics, New York, New York, USA
- Epidemiology Analytics, Janssen Research and Development, Titusville, New Jersey, USA
| | - David Madigan
- Observational Health Data Sciences and Informatics, New York, New York, USA
- Northeastern University, Boston, Massachusetts, USA
| | - Patrick B. Ryan
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
- Observational Health Data Sciences and Informatics, New York, New York, USA
- Epidemiology Analytics, Janssen Research and Development, Titusville, New Jersey, USA
| | - Marc A. Suchard
- Observational Health Data Sciences and Informatics, New York, New York, USA
- Fielding School of Public Health, Department of Biostatistics, University of California, Los Angeles, Los Angeles, USA
- David Geffen School of Medicine, Department of Biomathematics, University of California, Los Angeles, Los Angeles, USA
| |
Collapse
|
20
|
Chen Q, Leaman R, Allot A, Luo L, Wei CH, Yan S, Lu Z. Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu Rev Biomed Data Sci 2021; 4:313-339. [PMID: 34465169 DOI: 10.1146/annurev-biodatasci-021821-061045] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP)-the branch of artificial intelligence that interprets human language-can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Shankai Yan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| |
Collapse
|
21
|
Guo Y, Zhang Y, Lyu T, Prosperi M, Wang F, Xu H, Bian J. The application of artificial intelligence and data integration in COVID-19 studies: a scoping review. J Am Med Inform Assoc 2021; 28:2050-2067. [PMID: 34151987 PMCID: PMC8344463 DOI: 10.1093/jamia/ocab098] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 05/03/2021] [Accepted: 05/06/2021] [Indexed: 12/23/2022] Open
Abstract
Objective To summarize how artificial intelligence (AI) is being applied in COVID-19 research and determine whether these AI applications integrated heterogenous data from different sources for modeling. Materials and Methods We searched 2 major COVID-19 literature databases, the National Institutes of Health’s LitCovid and the World Health Organization’s COVID-19 database on March 9, 2021. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline, 2 reviewers independently reviewed all the articles in 2 rounds of screening. Results In the 794 studies included in the final qualitative analysis, we identified 7 key COVID-19 research areas in which AI was applied, including disease forecasting, medical imaging-based diagnosis and prognosis, early detection and prognosis (non-imaging), drug repurposing and early drug discovery, social media data analysis, genomic, transcriptomic, and proteomic data analysis, and other COVID-19 research topics. We also found that there was a lack of heterogenous data integration in these AI applications. Discussion Risk factors relevant to COVID-19 outcomes exist in heterogeneous data sources, including electronic health records, surveillance systems, sociodemographic datasets, and many more. However, most AI applications in COVID-19 research adopted a single-sourced approach that could omit important risk factors and thus lead to biased algorithms. Integrating heterogeneous data for modeling will help realize the full potential of AI algorithms, improve precision, and reduce bias. Conclusion There is a lack of data integration in the AI applications in COVID-19 research and a need for a multilevel AI framework that supports the analysis of heterogeneous data from different sources.
Collapse
Affiliation(s)
- Yi Guo
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA.,Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, Florida, USA
| | - Yahan Zhang
- Department of Pharmaceutical Outcomes and Policy, College of Pharmacy, University of Florida, Gainesville, Florida, USA
| | - Tianchen Lyu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA.,Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, Florida, USA
| | - Mattia Prosperi
- Department of Epidemiology, College of Public Health and Health Professions & College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, New York, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA.,Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, Florida, USA
| |
Collapse
|
22
|
Fries JA, Steinberg E, Khattar S, Fleming SL, Posada J, Callahan A, Shah NH. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat Commun 2021; 12:2017. [PMID: 33795682 PMCID: PMC8016863 DOI: 10.1038/s41467-021-22328-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 02/26/2021] [Indexed: 02/07/2023] Open
Abstract
In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.
Collapse
Affiliation(s)
- Jason A Fries
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA.
| | - Ethan Steinberg
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Saelig Khattar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Scott L Fleming
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Jose Posada
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Alison Callahan
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Nigam H Shah
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
23
|
Leslie D, Mazumder A, Peppin A, Wolters MK, Hagerty A. Does "AI" stand for augmenting inequality in the era of covid-19 healthcare? BMJ 2021; 372:n304. [PMID: 33722847 PMCID: PMC7958301 DOI: 10.1136/bmj.n304] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Affiliation(s)
| | | | | | | | - Alexa Hagerty
- Centre for the Study of Existential Risk and Leverhulme Centre for the Future of Intelligence, University of Cambridge, Cambridge, UK
| |
Collapse
|
24
|
Cossio M, Gilardino RE. Would the Use of Artificial Intelligence in COVID-19 Patient Management Add Value to the Healthcare System? Front Med (Lausanne) 2021; 8:619202. [PMID: 33585525 PMCID: PMC7873524 DOI: 10.3389/fmed.2021.619202] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Accepted: 01/06/2021] [Indexed: 11/13/2022] Open
Affiliation(s)
- Manuel Cossio
- Artificial Intelligence Master's Program, Faculty of Informatics, Catalonian Polytechnic University, Barcelona, Spain
| | | |
Collapse
|