1
|
Tait K, Cronin J, Wiper O, Wallis J, Davies J, Dürichen R. ArcTEX-a novel clinical data enrichment pipeline to support real-world evidence oncology studies. Front Digit Health 2025; 7:1561358. [PMID: 40416094 PMCID: PMC12098606 DOI: 10.3389/fdgth.2025.1561358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Accepted: 04/23/2025] [Indexed: 05/27/2025] Open
Abstract
Data stored within electronic health records (EHRs) offer a valuable source of information for real-world evidence (RWE) studies in oncology. However, many key clinical features are only available within unstructured notes. We present ArcTEX, a novel data enrichment pipeline developed to extract oncological features from NHS unstructured clinical notes with high accuracy, even in resource-constrained environments where availability of GPUs might be limited. By design, the predicted outcomes of ArcTEX are free of patient-identifiable information, making this pipeline ideally suited for use in Trust environments. We compare our pipeline to existing discriminative and generative models, demonstrating its superiority over approaches such as Llama3/3.1/3.2 and other BERT based models, with a mean accuracy of 98.67% for several essential clinical features in endometrial and breast cancer. Additionally, we show that as few as 50 annotated training examples are needed to adapt the model to a different oncology area, such as lung cancer, with a different set of priority clinical features, achieving a comparable mean accuracy of 95% on average.
Collapse
Affiliation(s)
| | | | | | | | - Jim Davies
- Department of Computer Science, University of Oxford, Oxford, United Kingdom
| | | |
Collapse
|
2
|
韦 莉, 赵 德, 秦 璐, 刘 洋, 沈 宇, 叶 昌. [Medical text classification model integrating medical entity label semantics]. SHENG WU YI XUE GONG CHENG XUE ZA ZHI = JOURNAL OF BIOMEDICAL ENGINEERING = SHENGWU YIXUE GONGCHENGXUE ZAZHI 2025; 42:326-333. [PMID: 40288975 PMCID: PMC12035632 DOI: 10.7507/1001-5515.202408001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Revised: 01/16/2025] [Indexed: 04/29/2025]
Abstract
Automatic classification of medical questions is of great significance in improving the quality and efficiency of online medical services, and belongs to the task of intent recognition. Joint entity recognition and intent recognition perform better than single task models. Currently, most publicly available medical text intent recognition datasets lack entity annotation, and manual annotation of these entities requires a lot of time and manpower. To solve this problem, this paper proposes a medical text classification model, bidirectional encoder representation based on transformer-recurrent convolutional neural network-entity-label-semantics (BRELS), which integrates medical entity label semantics. This model firstly utilizes an adaptive fusion mechanism to absorb prior knowledge of medical entity labels, achieving local feature enhancement. Then in global feature extraction, a lightweight recurrent convolutional neural network (LRCNN) is used to suppress parameter growth while preserving the original semantics of the text. The ablation and comparison experiments are conducted on three public medical text intent recognition datasets to validate the performance of the model. The results show that F1 score reaches 87.34%, 81.71%, and 77.74% on each dataset, respectively. The results show that the BRELS model can effectively identify and understand medical terminology, thereby effectively identifying users' intentions, which can improve the quality and efficiency of online medical services.
Collapse
Affiliation(s)
- 莉 韦
- 重庆邮电大学 生命健康信息科学与工程学院 (重庆 400065)School of Life Health Information Science and Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - 德春 赵
- 重庆邮电大学 生命健康信息科学与工程学院 (重庆 400065)School of Life Health Information Science and Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - 璐 秦
- 重庆邮电大学 生命健康信息科学与工程学院 (重庆 400065)School of Life Health Information Science and Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - 洋华子 刘
- 重庆邮电大学 生命健康信息科学与工程学院 (重庆 400065)School of Life Health Information Science and Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - 宇辰 沈
- 重庆邮电大学 生命健康信息科学与工程学院 (重庆 400065)School of Life Health Information Science and Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| | - 昌荣 叶
- 重庆邮电大学 生命健康信息科学与工程学院 (重庆 400065)School of Life Health Information Science and Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, P. R. China
| |
Collapse
|
3
|
Li X, Shu Q, Kong C, Wang J, Li G, Fang X, Lou X, Yu G. An Intelligent System for Classifying Patient Complaints Using Machine Learning and Natural Language Processing: Development and Validation Study. J Med Internet Res 2025; 27:e55721. [PMID: 39778195 PMCID: PMC11754990 DOI: 10.2196/55721] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 04/28/2024] [Accepted: 11/04/2024] [Indexed: 01/11/2025] Open
Abstract
BACKGROUND Accurate classification of patient complaints is crucial for enhancing patient satisfaction management in health care settings. Traditional manual methods for categorizing complaints often lack efficiency and precision. Thus, there is a growing demand for advanced and automated approaches to streamline the classification process. OBJECTIVE This study aimed to develop and validate an intelligent system for automatically classifying patient complaints using machine learning (ML) and natural language processing (NLP) techniques. METHODS An ML-based NLP technology was proposed to extract frequently occurring dissatisfactory words related to departments, staff, and key treatment procedures. A dataset containing 1465 complaint records from 2019 to 2023 was used for training and validation, with an additional 376 complaints from Hangzhou Cancer Hospital serving as an external test set. Complaints were categorized into 4 types-communication problems, diagnosis and treatment issues, management problems, and sense of responsibility concerns. The imbalanced data were balanced using the Synthetic Minority Oversampling Technique (SMOTE) algorithm to ensure equal representation across all categories. A total of 3 ML algorithms (Multifactor Logistic Regression, Multinomial Naive Bayes, and Support Vector Machines [SVM]) were used for model training and validation. The best-performing model was tested using a 5-fold cross-validation on external data. RESULTS The original dataset consisted of 719, 376, 260, and 86 records for communication problems, diagnosis and treatment issues, management problems, and sense of responsibility concerns, respectively. The Multifactor Logistic Regression and SVM models achieved weighted average accuracies of 0.89 and 0.93 in the training set, and 0.83 and 0.87 in the internal test set, respectively. Ngram-level term frequency-inverse document frequency did not significantly improve classification performance, with only a marginal 1% increase in precision, recall, and F1-score when implementing Ngram-level term frequency-inverse document frequency (n=2) from 0.91 to 0.92. The SVM algorithm performed best in prediction, achieving an average accuracy of 0.91 on the external test set with a 95% CI of 0.87-0.97. CONCLUSIONS The NLP-driven SVM algorithm demonstrates effective classification performance in automatically categorizing patient complaint texts. It showed superior performance in both internal and external test sets for communication and management problems. However, caution is advised when using it for classifying sense of responsibility complaints. This approach holds promises for implementation in medical institutions with high complaint volumes and limited resources for addressing patient feedback.
Collapse
Affiliation(s)
- Xiadong Li
- Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center For Child Health, Hang Zhou, China
| | - Qiang Shu
- Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center For Child Health, Hang Zhou, China
| | - Canhong Kong
- Patient Service Surveillance Office, Medical Information Department, Hangzhou Red Cross Hospital, Hang Zhou, China
| | - Jinhu Wang
- Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center For Child Health, Hang Zhou, China
| | - Gang Li
- Department of Radiation Oncology, Zhe Jiang Xiaoshan hospital, Hangzhou Normal University, Hang Zhou, China
| | - Xin Fang
- Hospital Management Office, Hangzhou Cancer Hospital, Hang Zhou, China
| | - Xiaomin Lou
- Patient Service Surveillance Office, Hangzhou Red Cross Hospital, Hang Zhou, China
| | - Gang Yu
- Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center For Child Health, Hang Zhou, China
| |
Collapse
|
4
|
Lester RT, Manson M, Semakula M, Jang H, Mugabo H, Magzari A, Blackmer JM, Fattah F, Niyonsenga SP, Rwagasore E, Ruranga C, Remera E, Ngabonziza JCS, Carenini G, Nsanzimana S. Natural language processing to evaluate texting conversations between patients and healthcare providers during COVID-19 Home-Based Care in Rwanda at scale. PLOS DIGITAL HEALTH 2025; 4:e0000625. [PMID: 39813181 PMCID: PMC11734906 DOI: 10.1371/journal.pdig.0000625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 11/19/2024] [Indexed: 01/18/2025]
Abstract
Community isolation of patients with communicable infectious diseases limits spread of pathogens but our understanding of isolated patients' needs and challenges is incomplete. Rwanda deployed a digital health service nationally to assist public health clinicians to remotely monitor and support SARS-CoV-2 cases via their mobile phones using daily interactive short message service (SMS) check-ins. We aimed to assess the texting patterns and communicated topics to better understand patient experiences. We extracted data on all COVID-19 cases and exposed contacts who were enrolled in the WelTel text messaging program between March 18, 2020, and March 31, 2022, and linked demographic and clinical data from the national COVID-19 registry. A sample of the text conversation corpus was English-translated and labeled with topics of interest defined by medical experts. Multiple natural language processing (NLP) topic classification models were trained and compared using F1 scores. Best performing models were applied to classify unlabeled conversations. Total 33,081 isolated patients (mean age 33·9, range 0-100), 44% female, including 30,398 cases and 2,683 contacts) were registered in WelTel. Registered patients generated 12,119 interactive text conversations in Kinyarwanda (n = 8,183, 67%), English (n = 3,069, 25%) and other languages. Sufficiently trained large language models (LLMs) were unavailable for Kinyarwanda. Traditional machine learning (ML) models outperformed fine-tuned transformer architecture language models on the native untranslated language corpus, however, the reverse was observed of models trained on English-only data. The most frequently identified topics discussed included symptoms (69%), diagnostics (38%), social issues (19%), prevention (18%), healthcare logistics (16%), and treatment (8·5%). Education, advice, and triage on these topics were provided to patients. Interactive text messaging can be used to remotely support isolated patients in pandemics at scale. NLP can help evaluate the medical and social factors that affect isolated patients which could ultimately inform precision public health responses to future pandemics.
Collapse
Affiliation(s)
- Richard T. Lester
- Division of Infectious Diseases, Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Matthew Manson
- Division of Infectious Diseases, Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | - Muhammed Semakula
- Rwanda Ministry of Health, Kigali, Rwanda
- Rwanda Biomedical Centre, Kigali, Rwanda
| | - Hyeju Jang
- Luddy School of Informatics, Computing, and Engineering, Department of Computer Science Indiana University Indianapolis, Indianapolis, Indiana, United States
- Department of Computer Science, Faculty of Science, University of British Columbia, Vancouver, British Columbia, Canada
| | | | - Ali Magzari
- Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Junhong Ma Blackmer
- Department of Mathematics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Fanan Fattah
- Division of Infectious Diseases, Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada
| | | | | | - Charles Ruranga
- African Center of Excellence in Data Science, University of Rwanda, Kigali, Rwanda
| | | | - Jean Claude S. Ngabonziza
- Rwanda Biomedical Centre, Kigali, Rwanda
- Department of Clinical Biology, University of Rwanda, Kigali, Rwanda
| | - Giuseppe Carenini
- Department of Computer Science, Faculty of Science, University of British Columbia, Vancouver, British Columbia, Canada
| | | |
Collapse
|
5
|
Zhang X, Wang Y, Jiang Y, Pacella CB, Zhang W. Integrating structured and unstructured data for predicting emergency severity: an association and predictive study using transformer-based natural language processing models. BMC Med Inform Decis Mak 2024; 24:372. [PMID: 39633370 PMCID: PMC11619330 DOI: 10.1186/s12911-024-02793-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2024] [Accepted: 11/28/2024] [Indexed: 12/07/2024] Open
Abstract
BACKGROUND Efficient triage in emergency departments (EDs) is critical for timely and appropriate care. Traditional triage systems primarily rely on structured data, but the increasing availability of unstructured data, such as clinical notes, presents an opportunity to enhance predictive models for assessing emergency severity and to explore associations between patient characteristics and severity outcomes. This study aimed to evaluate the effectiveness of combining structured and unstructured data to predict emergency severity more accurately. METHODS Data from the 2021 National Hospital Ambulatory Medical Care Survey (NHAMCS) for adult ED patients were used. Emergency severity was categorized into urgent (scores 1-3) and non-urgent (scores 4-5) based on the Emergency Severity Index. Unstructured data, including chief complaints and reasons for visit, were processed using a Bidirectional Encoder Representations from Transformers (BERT) model. Structured data included patient demographics and clinical information. Four machine learning models-Logistic Regression, Random Forest, Gradient Boosting, and Extreme Gradient Boosting-were applied to three data configurations: structured data only, unstructured data only, and combined data. A mean probability model was also created by averaging the predicted probabilities from the structured and unstructured models. RESULTS The study included 8,716 adult patients, of whom 74.6% were classified as urgent. Association analysis revealed significant predictors of emergency severity, including older age (OR = 2.13 for patients 65 +), higher heart rate (OR = 1.56 for heart rates > 90 bpm), and specific chronic conditions such as chronic kidney disease (OR = 2.28) and coronary artery disease (OR = 2.55). Gradient Boosting with combined data demonstrated the highest performance, achieving an area under the curve (AUC) of 0.789, an accuracy of 0.726, and a precision of 0.892. The mean probability model also showed improvements over structured-only models. CONCLUSIONS Combining structured and unstructured data improved the prediction of emergency severity in ED patients, highlighting the potential for enhanced triage systems. Integrating text data into predictive models can provide more accurate and nuanced severity assessments, improving resource allocation and patient outcomes. Further research should focus on real-time application and validation in diverse clinical settings.
Collapse
Affiliation(s)
- Xingyu Zhang
- Department of Communication Science and Disorders, School of Health and Rehabilitation Sciences, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Yanshan Wang
- Department of Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | - Yun Jiang
- School of Nursing, University of Michigan, Ann Arbor, MI, USA
| | - Charissa B Pacella
- Department of Emergency Medicine, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Wenbin Zhang
- Knight Foundation School of Computing & Information Sciences, Florida International University, Miami, USA.
| |
Collapse
|
6
|
Culié D, Schiappa R, Contu S, Seutin E, Pace-Loscos T, Poissonnet G, Villarme A, Bozec A, Chamorey E. Enhancing Thyroid Pathology With Artificial Intelligence: Automated Data Extraction From Electronic Health Reports Using RUBY. JCO Clin Cancer Inform 2024; 8:e2300263. [PMID: 39657101 DOI: 10.1200/cci.23.00263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 07/11/2024] [Accepted: 09/25/2024] [Indexed: 12/17/2024] Open
Abstract
PURPOSE Thyroid nodules are common in the general population, and assessing their malignancy risk is the initial step in care. Surgical exploration remains the sole definitive option for indeterminate nodules. Extensive database access is crucial for improving this initial assessment. Our objective was to develop an automated process using convolutional neural networks (CNNs) to extract and structure biomedical insights from electronic health reports (EHRs) in a large thyroid pathology cohort. MATERIALS AND METHODS We randomly selected 1,500 patients with thyroid pathology from our cohort for model development and an additional 100 for testing. We then divided the cohort of 1,500 patients into training (70%) and validation (30%) sets. We used EHRs from initial surgeon visits, preanesthesia visits, ultrasound, surgery, and anatomopathology reports. We selected 42 variables of interest and had them manually annotated by a clinical expert. We developed RUBY-THYRO using six distinct CNN models from SpaCy, supplemented with keyword extraction rules and postprocessing. Evaluation against a gold standard database included calculating precision, recall, and F1 score. RESULTS Performance remained consistent across the test and validation sets, with the majority of variables (30/42) achieving performance metrics exceeding 90% for all metrics in both sets. Results differed according to the variables; pathologic tumor stage score achieved 100% in precision, recall, and F1 score, versus 45%, 28%, and 32% for the number of nodules in the test set, respectively. Surgical and preanesthesia reports demonstrated particularly high performance. CONCLUSION Our study successfully implemented a CNN-based natural language processing (NLP) approach for extracting and structuring data from various EHRs in thyroid pathology. This highlights the potential of artificial intelligence-driven NLP techniques for extensive and cost-effective data extraction, paving the way for creating comprehensive, hospital-wide data warehouses.
Collapse
Affiliation(s)
- Dorian Culié
- Cervico-Facial Oncology Surgical Department, University Institute of Face and Neck, Centre Antoine Lacassagne University of Côte d'Azur, Nice, France
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Renaud Schiappa
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Sara Contu
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Eva Seutin
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Tanguy Pace-Loscos
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| | - Gilles Poissonnet
- Cervico-Facial Oncology Surgical Department, University Institute of Face and Neck, Centre Antoine Lacassagne University of Côte d'Azur, Nice, France
| | - Agathe Villarme
- Cervico-Facial Oncology Surgical Department, University Institute of Face and Neck, Centre Antoine Lacassagne University of Côte d'Azur, Nice, France
| | - Alexandre Bozec
- Cervico-Facial Oncology Surgical Department, University Institute of Face and Neck, Centre Antoine Lacassagne University of Côte d'Azur, Nice, France
| | - Emmanuel Chamorey
- Department of Epidemiology, Biostatistics and Health Data, Centre Antoine Lacassagne, University of Côte d'Azur, Nice, France
| |
Collapse
|
7
|
Cheligeer K, Wu G, Laws A, Quan ML, Li A, Brisson AM, Xie J, Xu Y. Validation of large language models for detecting pathologic complete response in breast cancer using population-based pathology reports. BMC Med Inform Decis Mak 2024; 24:283. [PMID: 39363322 PMCID: PMC11447988 DOI: 10.1186/s12911-024-02677-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 09/09/2024] [Indexed: 10/05/2024] Open
Abstract
AIMS The primary goal of this study is to evaluate the capabilities of Large Language Models (LLMs) in understanding and processing complex medical documentation. We chose to focus on the identification of pathologic complete response (pCR) in narrative pathology reports. This approach aims to contribute to the advancement of comprehensive reporting, health research, and public health surveillance, thereby enhancing patient care and breast cancer management strategies. METHODS The study utilized two analytical pipelines, developed with open-source LLMs within the healthcare system's computing environment. First, we extracted embeddings from pathology reports using 15 different transformer-based models and then employed logistic regression on these embeddings to classify the presence or absence of pCR. Secondly, we fine-tuned the Generative Pre-trained Transformer-2 (GPT-2) model by attaching a simple feed-forward neural network (FFNN) layer to improve the detection performance of pCR from pathology reports. RESULTS In a cohort of 351 female breast cancer patients who underwent neoadjuvant chemotherapy (NAC) and subsequent surgery between 2010 and 2017 in Calgary, the optimized method displayed a sensitivity of 95.3% (95%CI: 84.0-100.0%), a positive predictive value of 90.9% (95%CI: 76.5-100.0%), and an F1 score of 93.0% (95%CI: 83.7-100.0%). The results, achieved through diverse LLM integration, surpassed traditional machine learning models, underscoring the potential of LLMs in clinical pathology information extraction. CONCLUSIONS The study successfully demonstrates the efficacy of LLMs in interpreting and processing digital pathology data, particularly for determining pCR in breast cancer patients post-NAC. The superior performance of LLM-based pipelines over traditional models highlights their significant potential in extracting and analyzing key clinical data from narrative reports. While promising, these findings highlight the need for future external validation to confirm the reliability and broader applicability of these methods.
Collapse
Affiliation(s)
- Ken Cheligeer
- The Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, Canada
- Provincial Research Data Services, Alberta Health Services, Calgary, Canada
| | - Guosong Wu
- The Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, Canada
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Alison Laws
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, Canada
- Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - May Lynn Quan
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Canada
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, Canada
- Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Andrea Li
- The Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Anne-Marie Brisson
- Department of Radiology, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Jason Xie
- The Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, Canada
| | - Yuan Xu
- The Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, Canada.
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Canada.
- Department of Surgery, Cumming School of Medicine, University of Calgary, Calgary, Canada.
- Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, Canada.
| |
Collapse
|
8
|
Kilroy D, Healy G, Caton S. Prediction of future customer needs using machine learning across multiple product categories. PLoS One 2024; 19:e0307180. [PMID: 39186503 PMCID: PMC11346667 DOI: 10.1371/journal.pone.0307180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 07/01/2024] [Indexed: 08/28/2024] Open
Abstract
In recent years, computational approaches for extracting customer needs from user generated content have been proposed. However, there is a lack of studies that focus on extracting unmet needs for future popular products. Therefore, this study presents a supervised keyphrase classification model which predicts needs that will become popular in real products in the marketplace. To do this, we utilize Trending Customer Needs (TCN)-a monthly dataset of trending keyphrase customer needs occurring in new products during 2011-2021 across multiple categories of Consumer Packaged Goods e.g. toothpaste, eyeliner, beer, etc. We are the first study to use this specific dataset and employ it by training a time series algorithm to learn the relationship between features we generate for each candidate keyphrase on Reddit to the ones in the dataset 1-3 years in the future. We show that our approach outperforms a baseline in the literature and through Multi-Task Learning can accurately predict needs for a category it wasn't trained on e.g. train on toothpaste, cereal, and beer products yet still predict for shampoo products. The findings from this research could provide many advantages to businesses such as gaining early access into markets.
Collapse
Affiliation(s)
- David Kilroy
- School of Computer Science, University College Dublin, Dublin, Ireland
| | - Graham Healy
- School of Computing, Dublin City University, Dublin, Ireland
| | - Simon Caton
- School of Computer Science, University College Dublin, Dublin, Ireland
| |
Collapse
|
9
|
Hou S, Tang T, Cheng S, Liu Y, Xia T, Chen T, Fuhrman J, Sun F. DeepMicroClass sorts metagenomic contigs into prokaryotes, eukaryotes and viruses. NAR Genom Bioinform 2024; 6:lqae044. [PMID: 38711860 PMCID: PMC11071121 DOI: 10.1093/nargab/lqae044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 03/18/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open
Abstract
Sequence classification facilitates a fundamental understanding of the structure of microbial communities. Binary metagenomic sequence classifiers are insufficient because environmental metagenomes are typically derived from multiple sequence sources. Here we introduce a deep-learning based sequence classifier, DeepMicroClass, that classifies metagenomic contigs into five sequence classes, i.e. viruses infecting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. DeepMicroClass achieved high performance for all sequence classes at various tested sequence lengths ranging from 500 bp to 100 kbps. By benchmarking on a synthetic dataset with variable sequence class composition, we showed that DeepMicroClass obtained better performance for eukaryotic, plasmid and viral contig classification than other state-of-the-art predictors. DeepMicroClass achieved comparable performance on viral sequence classification with geNomad and VirSorter2 when benchmarked on the CAMI II marine dataset. Using a coastal daily time-series metagenomic dataset as a case study, we showed that microbial eukaryotes and prokaryotic viruses are integral to microbial communities. By analyzing monthly metagenomes collected at HOT and BATS, we found relatively higher viral read proportions in the subsurface layer in late summer, consistent with the seasonal viral infection patterns prevalent in these areas. We expect DeepMicroClass will promote metagenomic studies of under-appreciated sequence types.
Collapse
Affiliation(s)
- Shengwei Hou
- Department of Ocean Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
- Marine and Environmental Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Tianqi Tang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Siliangyu Cheng
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Yuanhao Liu
- Department of Ocean Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Tian Xia
- Department of Ocean Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
| | - Ting Chen
- Department of Computer Science and Technology, Institute of Artificial Intelligence & BNRist, Tsinghua University, Beijing 100084, China
| | - Jed A Fuhrman
- Marine and Environmental Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
10
|
Del Gaizo J, Sherard C, Shorbaji K, Welch B, Mathi R, Kilic A. Prediction of coronary artery bypass graft outcomes using a single surgical note: An artificial intelligence-based prediction model study. PLoS One 2024; 19:e0300796. [PMID: 38662684 PMCID: PMC11045137 DOI: 10.1371/journal.pone.0300796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 03/05/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND Healthcare providers currently calculate risk of the composite outcome of morbidity or mortality associated with a coronary artery bypass grafting (CABG) surgery through manual input of variables into a logistic regression-based risk calculator. This study indicates that automated artificial intelligence (AI)-based techniques can instead calculate risk. Specifically, we present novel numerical embedding techniques that enable NLP (natural language processing) models to achieve higher performance than the risk calculator using a single preoperative surgical note. METHODS The most recent preoperative surgical consult notes of 1,738 patients who received an isolated CABG from July 1, 2014 to November 1, 2022 at a single institution were analyzed. The primary outcome was the Society of Thoracic Surgeons defined composite outcome of morbidity or mortality (MM). We tested three numerical-embedding techniques on the widely used TextCNN classification model: 1a) Basic embedding, treat numbers as word tokens; 1b) Basic embedding with a dataloader that Replaces out-of-context (ROOC) numbers with a tag, where context is defined as within a number of tokens of specified keywords; 2) ScaleNum, an embedding technique that scales in-context numbers via a learned sigmoid-linear-log function; and 3) AttnToNum, a ScaleNum-derivative that updates the ScaleNum embeddings via multi-headed attention applied to local context. Predictive performance was measured via area under the receiver operating characteristic curve (AUC) on holdout sets from 10 random-split experiments. For eXplainable-AI (X-AI), we calculate SHapley Additive exPlanation (SHAP) values at an ngram resolution (SHAP-N). While the analyses focus on TextCNN, we execute an analogous performance pipeline with a long short-term memory (LSTM) model to test if the numerical embedding advantage is robust to model architecture. RESULTS A total of 567 (32.6%) patients had MM following CABG. The embedding performances are as follows with the TextCNN architecture: 1a) Basic, mean AUC 0.788 [95% CI (confidence interval): 0.768-0.809]; 1b) ROOC, 0.801 [CI: 0.788-0.815]; 2) ScaleNum, 0.808 [CI: 0.785-0.821]; and 3) AttnToNum, 0.821 [CI: 0.806-0.834]. The LSTM architecture produced a similar trend. Permutation tests indicate that AttnToNum outperforms the other embedding techniques, though not statistically significant verse ScaleNum (p-value of .07). SHAP-N analyses indicate that the model learns to associate low blood urine nitrate (BUN) and creatinine values with survival. A correlation analysis of the attention-updated numerical embeddings indicates that AttnToNum learns to incorporate both number magnitude and local context to derive semantic similarities. CONCLUSION This research presents both quantitative and clinical novel contributions. Quantitatively, we contribute two new embedding techniques: AttnToNum and ScaleNum. Both can embed strictly positive and bounded numerical values, and both surpass basic embeddings in predictive performance. The results suggest AttnToNum outperforms ScaleNum. With regards to clinical research, we show that AI methods can predict outcomes after CABG using a single preoperative note at a performance that matches or surpasses the current risk calculator. These findings reveal the potential role of NLP in automated registry reporting and quality improvement.
Collapse
Affiliation(s)
- John Del Gaizo
- Division of Cardiothoracic Surgery, Department of Surgery, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Curry Sherard
- College of Medicine, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Khaled Shorbaji
- Division of Cardiothoracic Surgery, Department of Surgery, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Brett Welch
- Division of Cardiothoracic Surgery, Department of Surgery, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Roshan Mathi
- Division of Cardiothoracic Surgery, Department of Surgery, Medical University of South Carolina, Charleston, South Carolina, United States of America
- College of Medicine, Medical University of South Carolina, Charleston, South Carolina, United States of America
| | - Arman Kilic
- Division of Cardiothoracic Surgery, Department of Surgery, Medical University of South Carolina, Charleston, South Carolina, United States of America
| |
Collapse
|
11
|
Khairuddin MZF, Sankaranarayanan S, Hasikin K, Abd Razak NA, Omar R. Contextualizing injury severity from occupational accident reports using an optimized deep learning prediction model. PeerJ Comput Sci 2024; 10:e1985. [PMID: 38660193 PMCID: PMC11042013 DOI: 10.7717/peerj-cs.1985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 03/21/2024] [Indexed: 04/26/2024]
Abstract
Background This study introduced a novel approach for predicting occupational injury severity by leveraging deep learning-based text classification techniques to analyze unstructured narratives. Unlike conventional methods that rely on structured data, our approach recognizes the richness of information within injury narrative descriptions with the aim of extracting valuable insights for improved occupational injury severity assessment. Methods Natural language processing (NLP) techniques were harnessed to preprocess the occupational injury narratives obtained from the US Occupational Safety and Health Administration (OSHA) from January 2015 to June 2023. The methodology involved meticulous preprocessing of textual narratives to standardize text and eliminate noise, followed by the innovative integration of Term Frequency-Inverse Document Frequency (TF-IDF) and Global Vector (GloVe) word embeddings for effective text representation. The proposed predictive model adopts a novel Bidirectional Long Short-Term Memory (Bi-LSTM) architecture and is further refined through model optimization, including random search hyperparameters and in-depth feature importance analysis. The optimized Bi-LSTM model has been compared and validated against other machine learning classifiers which are naïve Bayes, support vector machine, random forest, decision trees, and K-nearest neighbor. Results The proposed optimized Bi-LSTM models' superior predictability, boasted an accuracy of 0.95 for hospitalization and 0.98 for amputation cases with faster model processing times. Interestingly, the feature importance analysis revealed predictive keywords related to the causal factors of occupational injuries thereby providing valuable insights to enhance model interpretability. Conclusion Our proposed optimized Bi-LSTM model offers safety and health practitioners an effective tool to empower workplace safety proactive measures, thereby contributing to business productivity and sustainability. This study lays the foundation for further exploration of predictive analytics in the occupational safety and health domain.
Collapse
Affiliation(s)
| | - Suresh Sankaranarayanan
- Department of Computer Science, College of Computer Science and Information Technology, King Faisal University, Hofuf, Kingdom of Saudi Arabia
| | - Khairunnisa Hasikin
- Department of Biomedical Engineering, Faculty of Engineering, Universiti Malaya, Kuala Lumpur, Kuala Lumpur, Malaysia
| | - Nasrul Anuar Abd Razak
- Department of Biomedical Engineering, Faculty of Engineering, Universiti Malaya, Kuala Lumpur, Kuala Lumpur, Malaysia
| | - Rosidah Omar
- Occupational and Environmental Health Unit, Kedah State Health Department, Alor Setar, Kedah, Malaysia
| |
Collapse
|
12
|
Beaulieu-Jones BK, Villamar MF, Scordis P, Bartmann AP, Ali W, Wissel BD, Alsentzer E, de Jong J, Patra A, Kohane I. Predicting seizure recurrence after an initial seizure-like episode from routine clinical notes using large language models: a retrospective cohort study. Lancet Digit Health 2023; 5:e882-e894. [PMID: 38000873 PMCID: PMC10695164 DOI: 10.1016/s2589-7500(23)00179-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 08/08/2023] [Accepted: 08/31/2023] [Indexed: 11/26/2023]
Abstract
BACKGROUND The evaluation and management of first-time seizure-like events in children can be difficult because these episodes are not always directly observed and might be epileptic seizures or other conditions (seizure mimics). We aimed to evaluate whether machine learning models using real-world data could predict seizure recurrence after an initial seizure-like event. METHODS This retrospective cohort study compared models trained and evaluated on two separate datasets between Jan 1, 2010, and Jan 1, 2020: electronic medical records (EMRs) at Boston Children's Hospital and de-identified, patient-level, administrative claims data from the IBM MarketScan research database. The study population comprised patients with an initial diagnosis of either epilepsy or convulsions before the age of 21 years, based on International Classification of Diseases, Clinical Modification (ICD-CM) codes. We compared machine learning-based predictive modelling using structured data (logistic regression and XGBoost) with emerging techniques in natural language processing by use of large language models. FINDINGS The primary cohort comprised 14 021 patients at Boston Children's Hospital matching inclusion criteria with an initial seizure-like event and the comparison cohort comprised 15 062 patients within the IBM MarketScan research database. Seizure recurrence based on a composite expert-derived definition occurred in 57% of patients at Boston Children's Hospital and 63% of patients within IBM MarketScan. Large language models with additional domain-specific and location-specific pre-training on patients excluded from the study (F1-score 0·826 [95% CI 0·817-0·835], AUC 0·897 [95% CI 0·875-0·913]) performed best. All large language models, including the base model without additional pre-training (F1-score 0·739 [95% CI 0·738-0·741], AUROC 0·846 [95% CI 0·826-0·861]) outperformed models trained with structured data. With structured data only, XGBoost outperformed logistic regression and XGBoost models trained with the Boston Children's Hospital EMR (logistic regression: F1-score 0·650 [95% CI 0·643-0·657], AUC 0·694 [95% CI 0·685-0·705], XGBoost: F1-score 0·679 [0·676-0·683], AUC 0·725 [0·717-0·734]) performed similarly to models trained on the IBM MarketScan database (logistic regression: F1-score 0·596 [0·590-0·601], AUC 0·670 [0·664-0·675], XGBoost: F1-score 0·678 [0·668-0·687], AUC 0·710 [0·703-0·714]). INTERPRETATION Physician's clinical notes about an initial seizure-like event include substantial signals for prediction of seizure recurrence, and additional domain-specific and location-specific pre-training can significantly improve the performance of clinical large language models, even for specialised cohorts. FUNDING UCB, National Institute of Neurological Disorders and Stroke (US National Institutes of Health).
Collapse
Affiliation(s)
- Brett K Beaulieu-Jones
- Department of Medicine, University of Chicago, Chicago, IL, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Mauricio F Villamar
- Department of Neurology, The Warren Alpert Medical School of Brown University, Providence, RI, USA
| | | | | | | | - Benjamin D Wissel
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Emily Alsentzer
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | | | | | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
13
|
Wang Z, Wang B, Ren M, Gao D. A new hazard event classification model via deep learning and multifractal. COMPUT IND 2023. [DOI: 10.1016/j.compind.2023.103875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
|
14
|
Predicting suicidal and self-injurious events in a correctional setting using AI algorithms on unstructured medical notes and structured data. J Psychiatr Res 2023; 160:19-27. [PMID: 36773344 DOI: 10.1016/j.jpsychires.2023.01.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Revised: 01/23/2023] [Accepted: 01/26/2023] [Indexed: 01/31/2023]
Abstract
Suicidal and self-injurious incidents in correctional settings deplete the institutional and healthcare resources, create disorder and stress for staff and other inmates. Traditional statistical analyses provide some guidance, but they can only be applied to structured data that are often difficult to collect and their recommendations are often expensive to act upon. This study aims to extract information from medical and mental health progress notes using AI algorithms to make actionable predictions of suicidal and self-injurious events to improve the efficiency of triage for health care services and prevent suicidal and injurious events from happening at California's Orange County Jails. The results showed that the notes data contain more information with respect to suicidal or injurious behaviors than the structured data available in the EHR database at the Orange County Jails. Using the notes data alone (under-sampled to 50%) in a Transformer Encoder model produced an AUC-ROC of 0.862, a Sensitivity of 0.816, and a Specificity of 0.738. Incorporating the information extracted from the notes data into traditional Machine Learning models as a feature alongside structured data (under-sampled to 50%) yielded better performance in terms of Sensitivity (AUC-ROC: 0.77, Sensitivity: 0.89, Specificity: 0.65). In addition, under-sampling is an effective approach to mitigating the impact of the extremely imbalanced classes.
Collapse
|
15
|
Hossain E, Rana R, Higgins N, Soar J, Barua PD, Pisani AR, Turner K. Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review. Comput Biol Med 2023; 155:106649. [PMID: 36805219 DOI: 10.1016/j.compbiomed.2023.106649] [Citation(s) in RCA: 88] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 01/04/2023] [Accepted: 02/07/2023] [Indexed: 02/12/2023]
Abstract
BACKGROUND Natural Language Processing (NLP) is widely used to extract clinical insights from Electronic Health Records (EHRs). However, the lack of annotated data, automated tools, and other challenges hinder the full utilisation of NLP for EHRs. Various Machine Learning (ML), Deep Learning (DL) and NLP techniques are studied and compared to understand the limitations and opportunities in this space comprehensively. METHODOLOGY After screening 261 articles from 11 databases, we included 127 papers for full-text review covering seven categories of articles: (1) medical note classification, (2) clinical entity recognition, (3) text summarisation, (4) deep learning (DL) and transfer learning architecture, (5) information extraction, (6) Medical language translation and (7) other NLP applications. This study follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. RESULT AND DISCUSSION EHR was the most commonly used data type among the selected articles, and the datasets were primarily unstructured. Various ML and DL methods were used, with prediction or classification being the most common application of ML or DL. The most common use cases were: the International Classification of Diseases, Ninth Revision (ICD-9) classification, clinical note analysis, and named entity recognition (NER) for clinical descriptions and research on psychiatric disorders. CONCLUSION We find that the adopted ML models were not adequately assessed. In addition, the data imbalance problem is quite important, yet we must find techniques to address this underlining problem. Future studies should address key limitations in studies, primarily identifying Lupus Nephritis, Suicide Attempts, perinatal self-harmed and ICD-9 classification.
Collapse
Affiliation(s)
- Elias Hossain
- School of Engineering & Physical Sciences, North South University, Dhaka 1229, Bangladesh.
| | - Rajib Rana
- School of Mathematics, Physics and Computing, University of Southern Queensland, Springfield Central QLD 4300, Australia
| | - Niall Higgins
- School of Management and Enterprise, University of Southern Queensland, Darling Heights QLD 4350, Australia; School of Nursing, Queensland University of Technology, Kelvin Grove, Brisbane, QLD 4000, Australia; Metro North Mental Health, Herston QLD 4029, Australia
| | - Jeffrey Soar
- School of Business, University of Southern Queensland, Springfield Central QLD 4300, Australia
| | - Prabal Datta Barua
- School of Business, University of Southern Queensland, Springfield Central QLD 4300, Australia
| | - Anthony R Pisani
- Center for the Study and Prevention of Suicide, University of Rochester, Rochester, NY, United States
| | - Kathryn Turner
- School of Nursing, Queensland University of Technology, Kelvin Grove, Brisbane, QLD 4000, Australia
| |
Collapse
|
16
|
Parwez MA, Fazil M, Arif M, Nafis MT, Auwul MR. Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2023; 2023:2989791. [PMID: 39262497 PMCID: PMC11390191 DOI: 10.1155/2023/2989791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 09/26/2022] [Accepted: 09/27/2022] [Indexed: 09/13/2024]
Abstract
Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientific literatures, clinical notes, and other structured and unstructured text resources are rapidly increasing and being stored in various data sources like PubMed. These massive text resources can be leveraged to extract valuable knowledge and insights using machine learning techniques. Recent advancement in neural network-based classification models has gained popularity which takes numeric vectors (aka word representation) of training data as the input to train classification models. Better the input vectors, more accurate would be the classification. Word representations are learned as the distribution of words in an embedding space, wherein each word has its vector and the semantically similar words based on the contexts appear nearby each other. However, such distributional word representations are incapable of encapsulating relational semantics between distant words. In the biomedical domain, relation mining is a well-studied problem which aims to extract relational words, which associates distant entities generally representing the subject and object of a sentence. Our goal is to capture the relational semantics information between distant words from a large corpus to learn enhanced word representation and employ the learned word representation for various natural language processing tasks such as text classification. In this article, we have proposed an application of biomedical relation triplets to learn word representation through incorporating relational semantic information within the distributional representation of words. In other words, the proposed approach aims to capture both distributional and relational contexts of the words to learn their numeric vectors from text corpus. We have also proposed an application of the learned word representations for text classification. The proposed approach is evaluated over multiple benchmark datasets, and the efficacy of the learned word representations is tested in terms of word similarity and concept categorization tasks. Our proposed approach provides better performance in comparison to the state-of-the-art GloVe model. Furthermore, we have applied the learned word representations to classify biomedical texts using four neural network-based classification models, and the classification accuracy further confirms the effectiveness of the learned word representations by our proposed approach.
Collapse
Affiliation(s)
- Md Aslam Parwez
- Department of Computer Science & Engineering, Jamia Hamdard, New Delhi, India
| | - Mohd Fazil
- University of Limerick, Limerick, Ireland
| | - Muhammad Arif
- Department of Computer Science, Superior University Lahore, Lahore 54000, Pakistan
| | - Md Tabrez Nafis
- Department of Computer Science & Engineering, Jamia Hamdard, New Delhi, India
| | - Md Rabiul Auwul
- Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Agricultural University, Gazipur 1706, Bangladesh
| |
Collapse
|
17
|
Turner J, Kantardzic M, Vickers-Smith R, Brown AG. Detecting Tweets Containing Cannabidiol-Related COVID-19 Misinformation Using Transformer Language Models and Warning Letters From Food and Drug Administration: Content Analysis and Identification. JMIR INFODEMIOLOGY 2023; 3:e38390. [PMID: 36844029 PMCID: PMC9941900 DOI: 10.2196/38390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 09/07/2022] [Accepted: 11/30/2022] [Indexed: 06/18/2023]
Abstract
BACKGROUND COVID-19 has introduced yet another opportunity to web-based sellers of loosely regulated substances, such as cannabidiol (CBD), to promote sales under false pretenses of curing the disease. Therefore, it has become necessary to innovate ways to identify such instances of misinformation. OBJECTIVE We sought to identify COVID-19 misinformation as it relates to the sales or promotion of CBD and used transformer-based language models to identify tweets semantically similar to quotes taken from known instances of misinformation. In this case, the known misinformation was the publicly available Warning Letters from Food and Drug Administration (FDA). METHODS We collected tweets using CBD- and COVID-19-related terms. Using a previously trained model, we extracted the tweets indicating commercialization and sales of CBD and annotated those containing COVID-19 misinformation according to the FDA definitions. We encoded the collection of tweets and misinformation quotes into sentence vectors and then calculated the cosine similarity between each quote and each tweet. This allowed us to establish a threshold to identify tweets that were making false claims regarding CBD and COVID-19 while minimizing the instances of false positives. RESULTS We demonstrated that by using quotes taken from Warning Letters issued by FDA to perpetrators of similar misinformation, we can identify semantically similar tweets that also contain misinformation. This was accomplished by identifying a cosine distance threshold between the sentence vectors of the Warning Letters and tweets. CONCLUSIONS This research shows that commercial CBD or COVID-19 misinformation can potentially be identified and curbed using transformer-based language models and known prior instances of misinformation. Our approach functions without the need for labeled data, potentially reducing the time at which misinformation can be identified. Our approach shows promise in that it is easily adapted to identify other forms of misinformation related to loosely regulated substances.
Collapse
Affiliation(s)
- Jason Turner
- Data Mining Lab Department of Computer Science and Engineering J B Speed School of Engineering, University of Louisville Louisville, KY United States
| | - Mehmed Kantardzic
- Data Mining Lab Department of Computer Science and Engineering J B Speed School of Engineering, University of Louisville Louisville, KY United States
| | - Rachel Vickers-Smith
- Department of Epidemiology and Environmental Health College of Public Health University of Kentucky Lexington, KY United States
| | - Andrew G Brown
- Department of Criminology and Criminal Justice Northern Arizona University Tempe, AZ United States
| |
Collapse
|
18
|
Wang Y, Wang Y, Peng Z, Zhang F, Zhou L, Yang F. Medical text classification based on the discriminative pre-training model and prompt-tuning. Digit Health 2023; 9:20552076231193213. [PMID: 37559830 PMCID: PMC10408339 DOI: 10.1177/20552076231193213] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 07/18/2023] [Indexed: 08/11/2023] Open
Abstract
Medical text classification, as a fundamental medical natural language processing task, aims to identify the categories to which a short medical text belongs. Current research has focused on performing the medical text classification task using a pre-training language model through fine-tuning. However, this paradigm introduces additional parameters when training extra classifiers. Recent studies have shown that the "prompt-tuning" paradigm induces better performance in many natural language processing tasks because it bridges the gap between pre-training goals and downstream tasks. The main idea of prompt-tuning is to transform binary or multi-classification tasks into mask prediction tasks by fully exploiting the features learned by pre-training language models. This study explores, for the first time, how to classify medical texts using a discriminative pre-training language model called ERNIE-Health through prompt-tuning. Specifically, we attempt to perform prompt-tuning based on the multi-token selection task, which is a pre-training task of ERNIE-Health. The raw text is wrapped into a new sequence with a template in which the category label is replaced by a [UNK] token. The model is then trained to calculate the probability distribution of the candidate categories. Our method is tested on the KUAKE-Question Intention Classification and CHiP-Clinical Trial Criterion datasets and obtains the accuracy values of 0.866 and 0.861. In addition, the loss values of our model decrease faster throughout the training period compared to the fine-tuning. The experimental results provide valuable insights to the community and suggest that prompt-tuning can be a promising approach to improve the performance of pre-training models in domain-specific tasks.
Collapse
Affiliation(s)
- Yu Wang
- School of Biomedical Engineering, Anhui Medical University, Hefei, China
| | - Yuan Wang
- Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, China
| | - Zhenwan Peng
- School of Biomedical Engineering, Anhui Medical University, Hefei, China
| | - Feifan Zhang
- School of Biomedical Engineering, Anhui Medical University, Hefei, China
| | - Luyao Zhou
- School of Biomedical Engineering, Anhui Medical University, Hefei, China
| | - Fei Yang
- School of Biomedical Engineering, Anhui Medical University, Hefei, China
| |
Collapse
|
19
|
Quantum Fruit Fly algorithm and ResNet50-VGG16 for medical diagnosis. Appl Soft Comput 2023. [DOI: 10.1016/j.asoc.2023.110055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
20
|
Natural Language Processing Techniques for Text Classification of Biomedical Documents: A Systematic Review. INFORMATION 2022. [DOI: 10.3390/info13100499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The classification of biomedical literature is engaged in a number of critical issues that physicians are expected to answer. In many cases, these issues are extremely difficult. This can be conducted for jobs such as diagnosis and treatment, as well as efficient representations of ideas such as medications, procedure codes, and patient visits, as well as in the quick search of a document or disease classification. Pathologies are being sought from clinical notes, among other sources. The goal of this systematic review is to analyze the literature on various problems of classification of medical texts of patients based on criteria such as: the quality of the evaluation metrics used, the different methods of machine learning applied, the different data sets, to highlight the best methods in this type of problem, and to identify the different challenges associated. The study covers the period from 1 January 2016 to 10 July 2022. We used multiple databases and archives of research articles, including Web Of Science, Scopus, MDPI, arXiv, IEEE, and ACM, to find 894 articles dealing with the subject of text classification, which we were able to filter using inclusion and exclusion criteria. Following a thorough review, we selected 33 articles dealing with biological text categorization issues. Following our investigation, we discovered two major issues linked to the methodology and data used for biomedical text classification. First, there is the data-centric challenge, followed by the data quality challenge.
Collapse
|
21
|
The natural language processing of radiology requests and reports of chest imaging: Comparing five transformer models’ multilabel classification and a proof-of-concept study. Health Informatics J 2022; 28:14604582221131198. [DOI: 10.1177/14604582221131198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Background Radiology requests and reports contain valuable information about diagnostic findings and indications, and transformer-based language models are promising for more accurate text classification. Methods In a retrospective study, 2256 radiologist-annotated radiology requests (8 classes) and reports (10 classes) were divided into training and testing datasets (90% and 10%, respectively) and used to train 32 models. Performance metrics were compared by model type (LSTM, Bertje, RobBERT, BERT-clinical, BERT-multilingual, BERT-base), text length, data prevalence, and training strategy. The best models were used to predict the remaining 40,873 cases’ categories of the datasets of requests and reports. Results The RobBERT model performed the best after 4000 training iterations, resulting in AUC values ranging from 0.808 [95% CI (0.757–0.859)] to 0.976 [95% CI (0.956–0.996)] for the requests and 0.746 [95% CI (0.689–0.802)] to 1.0 [95% CI (1.0–1.0)] for the reports. The AUC for the classification of normal reports was 0.95 [95% CI (0.922–0.979)]. The predicted data demonstrated variability of both diagnostic yield for various request classes and request patterns related to COVID-19 hospital admission data. Conclusion Transformer-based natural language processing is feasible for the multilabel classification of chest imaging request and report items. Diagnostic yield varies with the information in the requests.
Collapse
|