Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Zhang Y, Zhang Y, Qi P, Manning CD, Langlotz CP. Biomedical and clinical English model packages for the Stanza Python NLP library. J Am Med Inform Assoc 2021;28:1892-1899. [PMID: 34157094 PMCID: PMC8363782 DOI: 10.1093/jamia/ocab090] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 04/05/2021] [Accepted: 05/03/2021] [Indexed: 11/13/2022] Open

For:	Zhang Y, Zhang Y, Qi P, Manning CD, Langlotz CP. Biomedical and clinical English model packages for the Stanza Python NLP library. J Am Med Inform Assoc 2021;28:1892-1899. [PMID: 34157094 PMCID: PMC8363782 DOI: 10.1093/jamia/ocab090] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 04/05/2021] [Accepted: 05/03/2021] [Indexed: 11/13/2022] Open

Number

Cited by Other Article(s)

Zhang Z, Jiang A. Interactive dual-stream contrastive learning for radiology report generation. J Biomed Inform 2024;157:104718. [PMID: 39209086 DOI: 10.1016/j.jbi.2024.104718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 08/08/2024] [Accepted: 08/25/2024] [Indexed: 09/04/2024]

Jani M, Alfattni G, Belousov M, Laidlaw L, Zhang Y, Cheng M, Webb K, Hamilton R, Kanter AS, Dixon WG, Nenadic G. Development and evaluation of a text analytics algorithm for automated application of national COVID-19 shielding criteria in rheumatology patients. Ann Rheum Dis 2024;83:1082-1091. [PMID: 38575324 PMCID: PMC11287580 DOI: 10.1136/ard-2024-225544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 03/26/2024] [Indexed: 04/06/2024]

Abstract

INTRODUCTION

At the beginning of the COVID-19 pandemic, the UK's Scientific Committee issued extreme social distancing measures, termed 'shielding', aimed at a subpopulation deemed extremely clinically vulnerable to infection. National guidance for risk stratification was based on patients' age, comorbidities and immunosuppressive therapies, including biologics that are not captured in primary care records. This process required considerable clinician time to manually review outpatient letters. Our aim was to develop and evaluate an automated shielding algorithm by text-mining outpatient letter diagnoses and medications, reducing the need for future manual review.

METHODS

Rheumatology outpatient letters from a large UK foundation trust were retrieved. Free-text diagnoses were processed using Intelligent Medical Objects software (Concept Tagger), which used interface terminology for each condition mapped to Systematized Medical Nomenclature for Medicine-Clinical Terminology (SNOMED-CT) codes. We developed the Medication Concept Recognition tool (Named Entity Recognition) to retrieve medications' type, dose, duration and status (active/past) at the time of the letter. Age, diagnosis and medication variables were then combined to calculate a shielding score based on the most recent letter. The algorithm's performance was evaluated using clinical review as the gold standard. The time taken to deploy the developed algorithm on a larger patient subset was measured.

RESULTS

In total, 5942 free-text diagnoses were extracted and mapped to SNOMED-CT, with 13 665 free-text medications (n=803 patients). The automated algorithm demonstrated a sensitivity of 80% (95% CI: 75%, 85%) and specificity of 92% (95% CI: 90%, 94%). Positive likelihood ratio was 10 (95% CI: 8, 14), negative likelihood ratio was 0.21 (95% CI: 0.16, 0.28) and F1 score was 0.81. Evaluation of mismatches revealed that the algorithm performed correctly against the gold standard in most cases. The developed algorithm was then deployed on records from an additional 15 865 patients, which took 18 hours for data extraction and 1 hour to deploy.

DISCUSSION

An automated algorithm for risk stratification has several advantages including reducing clinician time for manual review to allow more time for direct care, improving efficiency and increasing transparency in individual patient communication. It has the potential to be adapted for future public health initiatives that require prompt automated review of hospital outpatient letters.

Collapse

Affiliation(s)

Meghna Jani Centre for Epidemiology Versus Arthritis, Centre for Musculoskeletal Research, The University of Manchester, Manchester, UK Department of Rheumatology, Northern Care Alliance NHS Foundation Trust Salford Care Organisation, Salford, UK NIHR Manchester Biomedical Research Centre, Manchester University NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
Ghada Alfattni Department of Computer Science, The University of Manchester, Manchester, UK Department of Computer Science, Jamoum University College, Umm Al-Qura University, Makkah, Saudi Arabia
Maksim Belousov Department of Computer Science, The University of Manchester, Manchester, UK
Lynn Laidlaw Centre for Epidemiology Versus Arthritis, Centre for Musculoskeletal Research, The University of Manchester, Manchester, UK
Yuanyuan Zhang Centre for Epidemiology Versus Arthritis, Centre for Musculoskeletal Research, The University of Manchester, Manchester, UK
Michael Cheng Department of Business Intelligence, Northern Care Alliance NHS Foundation Trust, Salford Care Organisation, Salford, UK
Karim Webb Department of Business Intelligence, Northern Care Alliance NHS Foundation Trust, Salford Care Organisation, Salford, UK
Robyn Hamilton Department of Business Intelligence, Northern Care Alliance NHS Foundation Trust, Salford Care Organisation, Salford, UK
Andrew S Kanter Department of Biomedical Informatics, Columbia University, New York, New York, USA
William G Dixon Centre for Epidemiology Versus Arthritis, Centre for Musculoskeletal Research, The University of Manchester, Manchester, UK Department of Rheumatology, Northern Care Alliance NHS Foundation Trust Salford Care Organisation, Salford, UK NIHR Manchester Biomedical Research Centre, Manchester University NHS Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK
Goran Nenadic Department of Computer Science, The University of Manchester, Manchester, UK

Collapse

Patel MA, Daley M, Van Nynatten LR, Slessarev M, Cepinskas G, Fraser DD. A reduced proteomic signature in critically ill Covid-19 patients determined with plasma antibody micro-array and machine learning. Clin Proteomics 2024;21:33. [PMID: 38760690 PMCID: PMC11100131 DOI: 10.1186/s12014-024-09488-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Accepted: 05/06/2024] [Indexed: 05/19/2024] Open

Abstract

BACKGROUND

COVID-19 is a complex, multi-system disease with varying severity and symptoms. Identifying changes in critically ill COVID-19 patients' proteomes enables a better understanding of markers associated with susceptibility, symptoms, and treatment. We performed plasma antibody microarray and machine learning analyses to identify novel proteins of COVID-19.

METHODS

A case-control study comparing the concentration of 2000 plasma proteins in age- and sex-matched COVID-19 inpatients, non-COVID-19 sepsis controls, and healthy control subjects. Machine learning was used to identify a unique proteome signature in COVID-19 patients. Protein expression was correlated with clinically relevant variables and analyzed for temporal changes over hospitalization days 1, 3, 7, and 10. Expert-curated protein expression information was analyzed with Natural language processing (NLP) to determine organ- and cell-specific expression.

RESULTS

Machine learning identified a 28-protein model that accurately differentiated COVID-19 patients from ICU non-COVID-19 patients (accuracy = 0.89, AUC = 1.00, F1 = 0.89) and healthy controls (accuracy = 0.89, AUC = 1.00, F1 = 0.88). An optimal nine-protein model (PF4V1, NUCB1, CrkL, SerpinD1, Fen1, GATA-4, ProSAAS, PARK7, and NET1) maintained high classification ability. Specific proteins correlated with hemoglobin, coagulation factors, hypertension, and high-flow nasal cannula intervention (P < 0.01). Time-course analysis of the 28 leading proteins demonstrated no significant temporal changes within the COVID-19 cohort. NLP analysis identified multi-system expression of the key proteins, with the digestive and nervous systems being the leading systems.

CONCLUSIONS

The plasma proteome of critically ill COVID-19 patients was distinguishable from that of non-COVID-19 sepsis controls and healthy control subjects. The leading 28 proteins and their subset of 9 proteins yielded accurate classification models and are expressed in multiple organ systems. The identified COVID-19 proteomic signature helps elucidate COVID-19 pathophysiology and may guide future COVID-19 treatment development.

Collapse

Böhringer D, Angelova P, Fuhrmann L, Zimmermann J, Schargus M, Eter N, Reinhard T. Automatic inference of ICD-10 codes from German ophthalmologic physicians' letters using natural language processing. Sci Rep 2024;14:9035. [PMID: 38641674 PMCID: PMC11031573 DOI: 10.1038/s41598-024-59926-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 04/16/2024] [Indexed: 04/21/2024] Open

Patel MA, Fraser DD, Daley M, Cepinskas G, Veraldi N, Grazioli S. The plasma proteome differentiates the multisystem inflammatory syndrome in children (MIS-C) from children with SARS-CoV-2 negative sepsis. Mol Med 2024;30:51. [PMID: 38632526 PMCID: PMC11022403 DOI: 10.1186/s10020-024-00806-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 03/09/2024] [Indexed: 04/19/2024] Open

Abstract

BACKGROUND

The Multi-System Inflammatory Syndrome in Children (MIS-C) can develop several weeks after SARS-CoV-2 infection and requires a distinct treatment protocol. Distinguishing MIS-C from SARS-CoV-2 negative sepsis (SCNS) patients is important to quickly institute the correct therapies. We performed targeted proteomics and machine learning analysis to identify novel plasma proteins of MIS-C for early disease recognition.

METHODS

A case-control study comparing the expression of 2,870 unique blood proteins in MIS-C versus SCNS patients, measured using proximity extension assays. The 2,870 proteins were reduced in number with either feature selection alone or with a prior COMBAT-Seq batch effect adjustment. The leading proteins were correlated with demographic and clinical variables. Organ system and cell type expression patterns were analyzed with Natural Language Processing (NLP).

RESULTS

The cohorts were well-balanced for age and sex. Of the 2,870 unique blood proteins, 58 proteins were identified with feature selection (FDR-adjusted P < 0.005, P < 0.0001; accuracy = 0.96, AUC = 1.00, F1 = 0.95), and 15 proteins were identified with a COMBAT-Seq batch effect adjusted feature selection (FDR-adjusted P < 0.05, P < 0.0001; accuracy = 0.92, AUC = 1.00, F1 = 0.89). All of the latter 15 proteins were present in the former 58-protein model. Several proteins were correlated with illness severity scores, length of stay, and interventions (LTA4H, PTN, PPBP, and EGF; P < 0.001). NLP analysis highlighted the multi-system nature of MIS-C, with the 58-protein set expressed in all organ systems; the highest levels of expression were found in the digestive system. The cell types most involved included leukocytes not yet determined, lymphocytes, macrophages, and platelets.

CONCLUSIONS

The plasma proteome of MIS-C patients was distinct from that of SCNS. The key proteins demonstrated expression in all organ systems and most cell types. The unique proteomic signature identified in MIS-C patients could aid future diagnostic and therapeutic advancements, as well as predict hospital length of stays, interventions, and mortality risks.

Collapse

Mateu-Sanz M, Fuenteslópez CV, Uribe-Gomez J, Haugen HJ, Pandit A, Ginebra MP, Hakimi O, Krallinger M, Samara A. Redefining biomaterial biocompatibility: challenges for artificial intelligence and text mining. Trends Biotechnol 2024;42:402-417. [PMID: 37858386 DOI: 10.1016/j.tibtech.2023.09.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 09/25/2023] [Accepted: 09/26/2023] [Indexed: 10/21/2023]

Gu S, Lee EW, Zhang W, Simpson RL, Hertzberg VS, Ho JC. Evaluating Natural Language Processing Packages for Predicting Hospital-Acquired Pressure Injuries From Clinical Notes. Comput Inform Nurs 2024;42:184-192. [PMID: 37607706 PMCID: PMC10884344 DOI: 10.1097/cin.0000000000001053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]

Wang R, Jayathunge K, Page R, Li H, Zhang JJ, Yang X. Hybrid architecture based intelligent diagnosis assistant for GP. BMC Med Inform Decis Mak 2024;24:15. [PMID: 38200559 PMCID: PMC10777579 DOI: 10.1186/s12911-023-02398-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2023] [Accepted: 12/07/2023] [Indexed: 01/12/2024] Open

Liao Y, Liu H, Spasić I. Fine-tuning coreference resolution for different styles of clinical narratives. J Biomed Inform 2024;149:104578. [PMID: 38122841 DOI: 10.1016/j.jbi.2023.104578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 11/22/2023] [Accepted: 12/12/2023] [Indexed: 12/23/2023]

Abstract

OBJECTIVE

Coreference resolution (CR) is a natural language processing (NLP) task that is concerned with finding all expressions within a single document that refer to the same entity. This makes it crucial in supporting downstream NLP tasks such as summarization, question answering and information extraction. Despite great progress in CR, our experiments have highlighted a substandard performance of the existing open-source CR tools in the clinical domain. We set out to explore some practical solutions to fine-tune their performance on clinical data.

METHODS

We first explored the possibility of automatically producing silver standards following the success of such an approach in other clinical NLP tasks. We designed an ensemble approach that leverages multiple models to automatically annotate co-referring mentions. Subsequently, we looked into other ways of incorporating human feedback to improve the performance of an existing neural network approach. We proposed a semi-automatic annotation process to facilitate the manual annotation process. We also compared the effectiveness of active learning relative to random sampling in an effort to further reduce the cost of manual annotation.

RESULTS

Our experiments demonstrated that the silver standard approach was ineffective in fine-tuning the CR models. Our results indicated that active learning should also be applied with caution. The semi-automatic annotation approach combined with continued training was found to be well suited for the rapid transfer of CR models under low-resource conditions. The ensemble approach demonstrated a potential to further improve accuracy by leveraging multiple fine-tuned models.

CONCLUSION

Overall, we have effectively transferred a general CR model to a clinical domain. Our findings based on extensive experimentation have been summarized into practical suggestions for rapid transferring of CR models across different styles of clinical narratives.

Collapse

Ouis MY, A Akhloufi M. Deep learning for report generation on chest X-ray images. Comput Med Imaging Graph 2024;111:102320. [PMID: 38134726 DOI: 10.1016/j.compmedimag.2023.102320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 11/13/2023] [Accepted: 11/29/2023] [Indexed: 12/24/2023]

Fraile Navarro D, Ijaz K, Rezazadegan D, Rahimi-Ardabili H, Dras M, Coiera E, Berkovsky S. Clinical named entity recognition and relation extraction using natural language processing of medical free text: A systematic review. Int J Med Inform 2023;177:105122. [PMID: 37295138 DOI: 10.1016/j.ijmedinf.2023.105122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2022] [Revised: 04/14/2023] [Accepted: 06/03/2023] [Indexed: 06/12/2023]

Abstract

BACKGROUND

Natural Language Processing (NLP) applications have developed over the past years in various fields including its application to clinical free text for named entity recognition and relation extraction. However, there has been rapid developments the last few years that there's currently no overview of it. Moreover, it is unclear how these models and tools have been translated into clinical practice. We aim to synthesize and review these developments.

METHODS

We reviewed literature from 2010 to date, searching PubMed, Scopus, the Association of Computational Linguistics (ACL), and Association of Computer Machinery (ACM) libraries for studies of NLP systems performing general-purpose (i.e., not disease- or treatment-specific) information extraction and relation extraction tasks in unstructured clinical text (e.g., discharge summaries).

RESULTS

We included in the review 94 studies with 30 studies published in the last three years. Machine learning methods were used in 68 studies, rule-based in 5 studies, and both in 22 studies. 63 studies focused on Named Entity Recognition, 13 on Relation Extraction and 18 performed both. The most frequently extracted entities were "problem", "test" and "treatment". 72 studies used public datasets and 22 studies used proprietary datasets alone. Only 14 studies defined clearly a clinical or information task to be addressed by the system and just three studies reported its use outside the experimental setting. Only 7 studies shared a pre-trained model and only 8 an available software tool.

DISCUSSION

Machine learning-based methods have dominated the NLP field on information extraction tasks. More recently, Transformer-based language models are taking the lead and showing the strongest performance. However, these developments are mostly based on a few datasets and generic annotations, with very few real-world use cases. This may raise questions about the generalizability of findings, translation into practice and highlights the need for robust clinical evaluation.

Collapse

Raza S, Schwartz B, Lakamana S, Ge Y, Sarker A. A framework for multi-faceted content analysis of social media chatter regarding non-medical use of prescription medications. BMC DIGITAL HEALTH 2023;1:29. [PMID: 37680768 PMCID: PMC10483682 DOI: 10.1186/s44247-023-00029-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 07/17/2023] [Indexed: 09/09/2023]

Abstract

Background

Substance use, including the non-medical use of prescription medications, is a global health problem resulting in hundreds of thousands of overdose deaths and other health problems. Social media has emerged as a potent source of information for studying substance use-related behaviours and their consequences. Mining large-scale social media data on the topic requires the development of natural language processing (NLP) and machine learning frameworks customized for this problem. Our objective in this research is to develop a framework for conducting a content analysis of Twitter chatter about the non-medical use of a set of prescription medications.

Methods

We collected Twitter data for four medications-fentanyl and morphine (opioids), alprazolam (benzodiazepine), and Adderall® (stimulant), and identified posts that indicated non-medical use using an automatic machine learning classifier. In our NLP framework, we applied supervised named entity recognition (NER) to identify other substances mentioned, symptoms, and adverse events. We applied unsupervised topic modelling to identify latent topics associated with the chatter for each medication.

Results

The quantitative analysis demonstrated the performance of the proposed NER approach in identifying substance-related entities from data with a high degree of accuracy compared to the baseline methods. The performance evaluation of the topic modelling was also notable. The qualitative analysis revealed knowledge about the use, non-medical use, and side effects of these medications in individuals and communities.

Conclusions

NLP-based analyses of Twitter chatter associated with prescription medications belonging to different categories provide multi-faceted insights about their use and consequences. Our developed framework can be applied to chatter about other substances. Further research can validate the predictive value of this information on the prevention, assessment, and management of these disorders.

Collapse

Yoon W, Yi S, Jackson R, Kim H, Kim S, Kang J. Biomedical relation extraction with knowledge base-refined weak supervision. Database (Oxford) 2023;2023:baad054. [PMID: 37551911 PMCID: PMC10407973 DOI: 10.1093/database/baad054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 05/13/2023] [Accepted: 07/04/2023] [Indexed: 08/09/2023]

Adams G, Nguyen BH, Smith J, Xia Y, Xie S, Ostropolets A, Deb B, Chen YJ, Naumann T, Elhadad N. What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization. PROCEEDINGS OF THE CONFERENCE. ASSOCIATION FOR COMPUTATIONAL LINGUISTICS. MEETING 2023;2023:10520-10542. [PMID: 38689884 PMCID: PMC11059202 DOI: 10.18653/v1/2023.acl-long.587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2024]

Gao Y, Dligach D, Miller T, Churpek MM, Uzuner O, Afshar M. Progress Note Understanding - Assessment and Plan Reasoning: Overview of the 2022 N2C2 Track 3 shared task. J Biomed Inform 2023;142:104346. [PMID: 37061012 PMCID: PMC11178099 DOI: 10.1016/j.jbi.2023.104346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 03/16/2023] [Accepted: 03/21/2023] [Indexed: 04/17/2023]

Alasmari A, Kudryashov L, Yadav S, Lee H, Demner-Fushman D. CHQ- SocioEmo: Identifying Social and Emotional Support Needs in Consumer-Health Questions. Sci Data 2023;10:329. [PMID: 37244917 DOI: 10.1038/s41597-023-02203-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 05/02/2023] [Indexed: 05/29/2023] Open

Launer-Wachs S, Taub-Tabib H, Tokarev Madem J, Bar-Natan O, Goldberg Y, Shamay Y. From Centralized to Ad-Hoc Knowledge Base Construction for Hypotheses Generation. J Biomed Inform 2023;142:104383. [PMID: 37196989 DOI: 10.1016/j.jbi.2023.104383] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 04/27/2023] [Accepted: 05/03/2023] [Indexed: 05/19/2023]

Abstract

OBJECTIVE

To demonstrate and develop an approach enabling individual researchers or small teams to create their own ad-hoc, lightweight knowledge bases tailored for specialized scientific interests, using text-mining over scientific literature, and demonstrate the effectiveness of these knowledge bases in hypothesis generation and literature-based discovery (LBD).

METHODS

We propose a lightweight process using an extractive search framework to create ad-hoc knowledge bases, which require minimal training and no background in bio-curation or computer science. These knowledge bases are particularly effective for LBD and hypothesis generation using Swanson's ABC method. The personalized nature of the knowledge bases allows for a somewhat higher level of noise than "public facing" ones, as researchers are expected to have prior domain experience to separate signal from noise. Fact verification is shifted from exhaustive verification of the knowledge base to post-hoc verification of specific entries of interest, allowing researchers to assess the correctness of relevant knowledge base entries by considering the paragraphs in which the facts were introduced.

RESULTS

We demonstrate the methodology by constructing several knowledge bases of different kinds: three knowledge bases that support lab-internal hypothesis generation: Drug Delivery to Ovarian Tumors (DDOT); Tissue Engineering and Regeneration; Challenges in Cancer Research; and an additional comprehensive, accurate knowledge base designated as a public resource for the wider community on the topic of Cell Specific Drug Delivery (CSDD). In each case, we show the design and construction process, along with relevant visualizations for data exploration, and hypothesis generation. For CSDD and DDOT we also show meta-analysis, human evaluation, and in vitro experimental evaluation.

CONCLUSION

Our approach enables researchers to create personalized, lightweight knowledge bases for specialized scientific interests, effectively facilitating hypothesis generation and literature-based discovery (LBD). By shifting fact verification efforts to post-hoc verification of specific entries, researchers can focus on exploring and generating hypotheses based on their expertise. The constructed knowledge bases demonstrate the versatility and adaptability of our approach to versatile research interests. The web-based platform, available at https://spike-kbc.apps.allenai.org , provides researchers with a valuable tool for rapid construction of knowledge bases tailored to their needs.

Collapse

Luo L, Wei CH, Lai PT, Leaman R, Chen Q, Lu Z. AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning. Bioinformatics 2023;39:btad310. [PMID: 37171899 PMCID: PMC10212279 DOI: 10.1093/bioinformatics/btad310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 04/12/2023] [Accepted: 05/11/2023] [Indexed: 05/14/2023] Open

Tinn R, Cheng H, Gu Y, Usuyama N, Liu X, Naumann T, Gao J, Poon H. Fine-tuning large neural language models for biomedical natural language processing. PATTERNS (NEW YORK, N.Y.) 2023;4:100729. [PMID: 37123444 PMCID: PMC10140607 DOI: 10.1016/j.patter.2023.100729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 12/12/2022] [Accepted: 03/17/2023] [Indexed: 05/02/2023]

Guo L, Wang W, Wu YJ. What Do MBA Program in Southeast Asia Scholars Propose for Future COVID-19 Research in Academic Publications? A Topic Analysis Based on Autoencoder. SAGE OPEN 2023;13:21582440231182060. [PMID: 37362769 PMCID: PMC10280124 DOI: 10.1177/21582440231182060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/28/2023]

Leaman R, Islamaj R, Adams V, Alliheedi MA, Almeida JR, Antunes R, Bevan R, Chang YC, Erdengasileng A, Hodgskiss M, Ida R, Kim H, Li K, Mercer RE, Mertová L, Mobasher G, Shin HC, Sung M, Tsujimura T, Yeh WC, Lu Z. Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII. Database (Oxford) 2023;2023:7071696. [PMID: 36882099 PMCID: PMC9991492 DOI: 10.1093/database/baad005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 01/06/2023] [Accepted: 02/15/2023] [Indexed: 03/09/2023]

Abstract

The BioCreative National Library of Medicine (NLM)-Chem track calls for a community effort to fine-tune automated recognition of chemical names in the biomedical literature. Chemicals are one of the most searched biomedical entities in PubMed, and-as highlighted during the coronavirus disease 2019 pandemic-their identification may significantly advance research in multiple biomedical subfields. While previous community challenges focused on identifying chemical names mentioned in titles and abstracts, the full text contains valuable additional detail. We, therefore, organized the BioCreative NLM-Chem track as a community effort to address automated chemical entity recognition in full-text articles. The track consisted of two tasks: (i) chemical identification and (ii) chemical indexing. The chemical identification task required predicting all chemicals mentioned in recently published full-text articles, both span [i.e. named entity recognition (NER)] and normalization (i.e. entity linking), using Medical Subject Headings (MeSH). The chemical indexing task required identifying which chemicals reflect topics for each article and should therefore appear in the listing of MeSH terms for the document in the MEDLINE article indexing. This manuscript summarizes the BioCreative NLM-Chem track and post-challenge experiments. We received a total of 85 submissions from 17 teams worldwide. The highest performance achieved for the chemical identification task was 0.8672 F-score (0.8759 precision and 0.8587 recall) for strict NER performance and 0.8136 F-score (0.8621 precision and 0.7702 recall) for strict normalization performance. The highest performance achieved for the chemical indexing task was 0.6073 F-score (0.7417 precision and 0.5141 recall). This community challenge demonstrated that (i) the current substantial achievements in deep learning technologies can be utilized to improve automated prediction accuracy further and (ii) the chemical indexing task is substantially more challenging. We look forward to further developing biomedical text-mining methods to respond to the rapid growth of biomedical literature. The NLM-Chem track dataset and other challenge materials are publicly available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.

Collapse

Affiliation(s)

Robert Leaman
Rezarta Islamaj
Virginia Adams NVIDIA, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA
Mohammed A Alliheedi Department of Computer Science, Al Baha University, 4781 King Fahd Rd, Al Aqiq 65779, Saudi Arabia
João Rafael Almeida Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal Department of Information and Communications Technologies, University of A Coruña, Camiño do Lagar de Castro, A Coruña 15008, Spain
Rui Antunes Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
Robert Bevan Informatics Department, Medicines Discovery Catapult, Alderley Park, Block 35, Mereside, Macclesfield SK10 4ZF, UK
Yung-Chun Chang Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Da’an District, Taipei City , Taipei 106, Taiwan
Arslan Erdengasileng Department of Statistics, Florida State University, 117 N. Woodward Ave, Tallahassee, FL 32306, USA
Matthew Hodgskiss Informatics Department, Medicines Discovery Catapult, Alderley Park, Block 35, Mereside, Macclesfield SK10 4ZF, UK
Ryuki Ida Computational Intelligence Laboratory, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Aichi 468-8511, Japan
Hyunjae Kim Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
Keqiao Li Department of Statistics, Florida State University, 117 N. Woodward Ave, Tallahassee, FL 32306, USA
Robert E Mercer Department of Computer Science, The University of Western Ontario, Room 355, Middlesex College, Ontario , London N6A 5B7, Canada
Lukrécia Mertová Scientific Databases and Visualization Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany
Ghadeer Mobasher Scientific Databases and Visualization Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany Institute of Computer Science, Heidelberg University, Im Neuenheimer Feld 205, Heidelberg 69120, Germany
Hoo-Chang Shin NVIDIA, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA
Mujeen Sung Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
Tomoki Tsujimura Computational Intelligence Laboratory, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Aichi 468-8511, Japan
Wen-Chao Yeh Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan
Zhiyong Lu *Corresponding author: Tel: +1-301-594-7089; Fax: +1-301-480-2290;

Collapse

Zhang Y, Grant BMM, Hope AJ, Hung RJ, Warkentin MT, Lam ACL, Aggawal R, Xu M, Shepherd FA, Tsao MS, Xu W, Pakkal M, Liu G, McInnis MC. Using Recurrent Neural Networks to Extract High-Quality Information From Lung Cancer Screening Computerized Tomography Reports for Inter-Radiologist Audit and Feedback Quality Improvement. JCO Clin Cancer Inform 2023;7:e2200153. [PMID: 36930839 DOI: 10.1200/cci.22.00153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2023] Open

Abstract

PURPOSE

Lung cancer screening programs generate a high volume of low-dose computed tomography (LDCT) reports that contain valuable information, typically in a free-text format. High-performance named-entity recognition (NER) models can extract relevant information from these reports automatically for inter-radiologist quality control.

METHODS

Using LDCT report data from a longitudinal lung cancer screening program (8,305 reports; 3,124 participants; 2006-2019), we trained a rule-based model and two bidirectional long short-term memory (Bi-LSTM) NER neural network models to detect clinically relevant information from LDCT reports. Model performance was tested using F1 scores and compared with a published open-source radiology NER model (Stanza) in an independent evaluation set of 150 reports. The top performing model was applied to a data set of 6,948 reports for an inter-radiologist quality control assessment.

RESULTS

The best performing model, a Bi-LSTM NER recurrent neural network model, had an overall F1 score of 0.950, which outperformed Stanza (F1 score = 0.872) and a rule-based NER model (F1 score = 0.809). Recall (sensitivity) for the best Bi-LSTM model ranged from 0.916 to 0.991 for different entity types; precision (positive predictive value) ranged from 0.892 to 0.997. Test performance remained stable across time periods. There was an average of a 2.86-fold difference in the number of identified entities between the most and the least detailed radiologists.

CONCLUSION

We built an open-source Bi-LSTM NER model that outperformed other open-source or rule-based radiology NER models. This model can efficiently extract clinically relevant information from lung cancer screening computerized tomography reports with high accuracy, enabling efficient audit and feedback to improve quality of patient care.

Collapse

Affiliation(s)

Yucheng Zhang Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada
Benjamin M M Grant Medical Oncology and Hematology, Princess Margaret Cancer Centre, Toronto, ON, Canada
Andrew J Hope Radiation Medicine Program, Princess Margaret Cancer Centre, and Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
Rayjean J Hung Prosserman Centre for Population Health Research, Lunenfeld-Tanenbaum Research Institute, Sinai Health Systems, Toronto, ON, Canada Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
Matthew T Warkentin Prosserman Centre for Population Health Research, Lunenfeld-Tanenbaum Research Institute, Sinai Health Systems, Toronto, ON, Canada Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
Andrew C L Lam Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada Medical Oncology and Hematology, Princess Margaret Cancer Centre, Toronto, ON, Canada
Reenika Aggawal Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada Medical Oncology and Hematology, Princess Margaret Cancer Centre, Toronto, ON, Canada
Maria Xu Medical Oncology and Hematology, Princess Margaret Cancer Centre, Toronto, ON, Canada
Frances A Shepherd Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada Medical Oncology and Hematology, Princess Margaret Cancer Centre, Toronto, ON, Canada
Ming-Sound Tsao Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada Laboratory Medicine and Pathology, University Health Network, Toronto, ON, Canada
Wei Xu Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada Biostatistics, Princess Margaret Cancer Centre, Toronto, ON, Canada Computational Biology and Medicine Program, Princess Margaret Cancer Centre, Toronto, ON, Canada
Mini Pakkal Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada Division of Cardiothoracic Imaging, Joint Department of Medical Imaging, Toronto General Hospital, Toronto, ON, Canada
Geoffrey Liu Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada Medical Oncology and Hematology, Princess Margaret Cancer Centre, Toronto, ON, Canada Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada Biostatistics, Princess Margaret Cancer Centre, Toronto, ON, Canada Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
Micheal C McInnis Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada Division of Cardiothoracic Imaging, Joint Department of Medical Imaging, Toronto General Hospital, Toronto, ON, Canada

Collapse

Patel MA, Knauer MJ, Nicholson M, Daley M, Van Nynatten LR, Cepinskas G, Fraser DD. Organ and cell-specific biomarkers of Long-COVID identified with targeted proteomics and machine learning. Mol Med 2023;29:26. [PMID: 36809921 PMCID: PMC9942653 DOI: 10.1186/s10020-023-00610-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 01/13/2023] [Indexed: 02/24/2023] Open

Fan JW, Wang W, Huang M, Liu H, Hooten WM. Retrospective content analysis of consumer product reviews related to chronic pain. Front Digit Health 2023;5:958338. [PMID: 37168528 PMCID: PMC10165495 DOI: 10.3389/fdgth.2023.958338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 03/09/2023] [Indexed: 05/13/2023] Open

Tanwar A, Zhang J, Ive J, Gupta V, Guo Y. Phenotyping in clinical text with unsupervised numerical reasoning for patient stratification. Exp Biol Med (Maywood) 2022;247:2038-2052. [PMID: 36217914 PMCID: PMC9791305 DOI: 10.1177/15353702221118092] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open

Kaur N, Mittal A. CheXPrune: sparse chest X-ray report generation model using multi-attention and one-shot global pruning. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING 2022;14:7485-7497. [PMID: 36338854 PMCID: PMC9628486 DOI: 10.1007/s12652-022-04454-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 10/05/2022] [Indexed: 05/25/2023]

Kühnel L, Fluck J. We are not ready yet: limitations of state-of-the-art disease named entity recognizers. J Biomed Semantics 2022;13:26. [PMID: 36303237 DOI: 10.1186/s13326-022-00280-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Accepted: 10/12/2022] [Indexed: 11/10/2022] Open

Tang A, Deléger L, Bossy R, Zweigenbaum P, Nédellec C. Do syntactic trees enhance Bidirectional Encoder Representations from Transformers (BERT) models for chemical–drug relation extraction? Database (Oxford) 2022;2022:6675625. [PMID: 36006843 PMCID: PMC9408061 DOI: 10.1093/database/baac070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 07/14/2022] [Accepted: 08/12/2022] [Indexed: 11/14/2022]

Frei J, Soto-Rey I, Kramer F. DrNote: An open medical annotation service. PLOS DIGITAL HEALTH 2022;1:e0000086. [PMID: 36812581 PMCID: PMC9931362 DOI: 10.1371/journal.pdig.0000086] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 07/12/2022] [Indexed: 11/19/2022]

Luo L, Lai PT, Wei CH, Lu Z. A sequence labeling framework for extracting drug-protein relations from biomedical literature. Database (Oxford) 2022;2022:baac058. [PMID: 35856889 PMCID: PMC9297941 DOI: 10.1093/database/baac058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 05/24/2022] [Accepted: 07/14/2022] [Indexed: 06/15/2023]

Almeida T, Antunes R, F. Silva J, Almeida JR, Matos S. Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics. Database (Oxford) 2022;2022:6625810. [PMID: 35776534 PMCID: PMC9248917 DOI: 10.1093/database/baac047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 05/13/2022] [Accepted: 06/06/2022] [Indexed: 11/14/2022]

Abstract Abstract The identification of chemicals in articles has attracted a large interest in the biomedical scientific community, given its importance in drug development research. Most of previous research have focused on PubMed abstracts, and further investigation using full-text documents is required because these contain additional valuable information that must be explored. The manual expert task of indexing Medical Subject Headings (MeSH) terms to these articles later helps researchers find the most relevant publications for their ongoing work. The BioCreative VII NLM-Chem track fostered the development of systems for chemical identification and indexing in PubMed full-text articles. Chemical identification consisted in identifying the chemical mentions and linking these to unique MeSH identifiers. This manuscript describes our participation system and the post-challenge improvements we made. We propose a three-stage pipeline that individually performs chemical mention detection, entity normalization and indexing. Regarding chemical identification, we adopted a deep-learning solution that utilizes the PubMedBERT contextualized embeddings followed by a multilayer perceptron and a conditional random field tagging layer. For the normalization approach, we use a sieve-based dictionary filtering followed by a deep-learning similarity search strategy. Finally, for the indexing we developed rules for identifying the more relevant MeSH codes for each article. During the challenge, our system obtained the best official results in the normalization and indexing tasks despite the lower performance in the chemical mention recognition task. In a post-contest phase we boosted our results by improving our named entity recognition model with additional techniques. The final system achieved 0.8731, 0.8275 and 0.4849 in the chemical identification, normalization and indexing tasks, respectively. The code to reproduce our experiments and run the pipeline is publicly available. Database URL https://github.com/bioinformatics-ua/biocreativeVII_track2 Collapse

Zhang Y, Wang C, Soukaseum M, Vlachos DG, Fang H. Unleashing the Power of Knowledge Extraction from Scientific Literature in Catalysis. J Chem Inf Model 2022;62:3316-3330. [PMID: 35772028 DOI: 10.1021/acs.jcim.2c00359] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Yang S, Wu X, Ge S, Zhou SK, Xiao L. Knowledge matters: Chest radiology report generation with general and specific knowledge. Med Image Anal 2022;80:102510. [PMID: 35716558 DOI: 10.1016/j.media.2022.102510] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 06/01/2022] [Accepted: 06/06/2022] [Indexed: 10/18/2022]

Chen Z, Peng B, Ioannidis VN, Li M, Karypis G, Ning X. A knowledge graph of clinical trials ([Formula: see text]). Sci Rep 2022;12:4724. [PMID: 35304504 PMCID: PMC8933553 DOI: 10.1038/s41598-022-08454-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 02/28/2022] [Indexed: 02/05/2023] Open

Davoudi A, Lee NS, Luong T, Delaney T, Asch E, Chaiyachati K, Mowery D. Identifying Medication-related Intents from a Bidirectional Text Messaging Platform for Hypertension Management: A Pilot Study using a Unsupervised Learning Approach (Preprint). J Med Internet Res 2022;24:e36151. [PMID: 35767327 PMCID: PMC9280462 DOI: 10.2196/36151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 04/01/2022] [Accepted: 05/17/2022] [Indexed: 12/02/2022] Open

Abstract

Background

Free-text communication between patients and providers plays an increasing role in chronic disease management, through platforms varying from traditional health care portals to novel mobile messaging apps. These text data are rich resources for clinical purposes, but their sheer volume render them difficult to manage. Even automated approaches, such as natural language processing, require labor-intensive manual classification for developing training data sets. Automated approaches to organizing free-text data are necessary to facilitate use of free-text communication for clinical care.

Objective

The aim of this study was to apply unsupervised learning approaches to (1) understand the types of topics discussed and (2) learn medication-related intents from messages sent between patients and providers through a bidirectional text messaging system for managing participant blood pressure (BP).

Methods

This study was a secondary analysis of deidentified messages from a remote, mobile, text-based employee hypertension management program at an academic institution. We trained a latent Dirichlet allocation (LDA) model for each message type (ie, inbound patient messages and outbound provider messages) and identified the distribution of major topics and significant topics (probability >.20) across message types. Next, we annotated all medication-related messages with a single medication intent. Then, we trained a second medication-specific LDA (medLDA) model to assess how well the unsupervised method could identify more fine-grained medication intents. We encoded each medication message with n-grams (n=1-3 words) using spaCy, clinical named entities using Stanza, and medication categories using MedEx; we then applied chi-square feature selection to learn the most informative features associated with each medication intent.

Results

In total, 253 participants and 5 providers engaged in the program, generating 12,131 total messages: 46.90% (n=5689) patient messages and 53.10% (n=6442) provider messages. Most patient messages corresponded to BP reporting, BP encouragement, and appointment scheduling; most provider messages corresponded to BP reporting, medication adherence, and confirmatory statements. Most patient and provider messages contained 1 topic and few contained more than 3 topics identified using LDA. In total, 534 medication messages were annotated with a single medication intent. Of these, 282 (52.8%) were patient medication messages: most referred to the medication request intent (n=134, 47.5%). Most of the 252 (47.2%) provider medication messages referred to the medication question intent (n=173, 68.7%). Although the medLDA model could identify a majority intent within each topic, it could not distinguish medication intents with low prevalence within patient or provider messages. Richer feature engineering identified informative lexical-semantic patterns associated with each medication intent class.

Conclusions

LDA can be an effective method for generating subgroups of messages with similar term usage and facilitating the review of topics to inform annotations. However, few training cases and shared vocabulary between intents precludes the use of LDA for fully automated, deep, medication intent classification.

International Registered Report Identifier (IRRID)

RR2-10.1101/2021.12.23.21268061

Collapse

A Deep Learning Based Approach to Automate Clinical Coding of Electronic Health Records. BIG DATA ANALYTICS 2022. [DOI: 10.1007/978-3-031-24094-2_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open

Lossio-Ventura JA, Sun R, Boussard S, Hernandez-Boussard T. Clinical concept recognition: Evaluation of existing systems on EHRs. Front Artif Intell 2022;5:1051724. [PMID: 36714202 PMCID: PMC9880223 DOI: 10.3389/frai.2022.1051724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 12/15/2022] [Indexed: 01/15/2023] Open

Abstract

Objective

The adoption of electronic health records (EHRs) has produced enormous amounts of data, creating research opportunities in clinical data sciences. Several concept recognition systems have been developed to facilitate clinical information extraction from these data. While studies exist that compare the performance of many concept recognition systems, they are typically developed internally and may be biased due to different internal implementations, parameters used, and limited number of systems included in the evaluations. The goal of this research is to evaluate the performance of existing systems to retrieve relevant clinical concepts from EHRs.

Methods

We investigated six concept recognition systems, including CLAMP, cTAKES, MetaMap, NCBO Annotator, QuickUMLS, and ScispaCy. Clinical concepts extracted included procedures, disorders, medications, and anatomical location. The system performance was evaluated on two datasets: the 2010 i2b2 and the MIMIC-III. Additionally, we assessed the performance of these systems in five challenging situations, including negation, severity, abbreviation, ambiguity, and misspelling.

Results

For clinical concept extraction, CLAMP achieved the best performance on exact and inexact matching, with an F-score of 0.70 and 0.94, respectively, on i2b2; and 0.39 and 0.50, respectively, on MIMIC-III. Across the five challenging situations, ScispaCy excelled in extracting abbreviation information (F-score: 0.86) followed by NCBO Annotator (F-score: 0.79). CLAMP outperformed in extracting severity terms (F-score 0.73) followed by NCBO Annotator (F-score: 0.68). CLAMP outperformed other systems in extracting negated concepts (F-score 0.63).

Conclusions

Several concept recognition systems exist to extract clinical information from unstructured data. This study provides an external evaluation by end-users of six commonly used systems across different extraction tasks. Our findings suggest that CLAMP provides the most comprehensive set of annotations for clinical concept extraction tasks and associated challenges. Comparing standard extraction tasks across systems provides guidance to other clinical researchers when selecting a concept recognition system relevant to their clinical information extraction task.

Collapse

Percha B. Modern Clinical Text Mining: A Guide and Review. Annu Rev Biomed Data Sci 2021;4:165-187. [PMID: 34465177 DOI: 10.1146/annurev-biodatasci-030421-030931] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]