Gu D, Wang Q, Chai Y, Yang X, Zhao W, Li M, Zolotarev O, Xu Z, Zhang G. Identifying the Risk Factors of Allergic Rhinitis Based on Zhihu Comment Data Using a Topic-Enhanced Word-Embedding Model: Mixed Method Study and Cluster Analysis.
J Med Internet Res 2024;
26:e48324. [PMID:
38386404 PMCID:
PMC10921335 DOI:
10.2196/48324]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 10/30/2023] [Accepted: 01/03/2024] [Indexed: 02/23/2024] Open
Abstract
BACKGROUND
Allergic rhinitis (AR) is a chronic disease, and several risk factors predispose individuals to the condition in their daily lives, including exposure to allergens and inhalation irritants. Analyzing the potential risk factors that can trigger AR can provide reference material for individuals to use to reduce its occurrence in their daily lives. Nowadays, social media is a part of daily life, with an increasing number of people using at least 1 platform regularly. Social media enables users to share experiences among large groups of people who share the same interests and experience the same afflictions. Notably, these channels promote the ability to share health information.
OBJECTIVE
This study aims to construct an intelligent method (TopicS-ClusterREV) for identifying the risk factors of AR based on these social media comments. The main questions were as follows: How many comments contained AR risk factor information? How many categories can these risk factors be summarized into? How do these risk factors trigger AR?
METHODS
This study crawled all the data from May 2012 to May 2022 under the topic of allergic rhinitis on Zhihu, obtaining a total of 9628 posts and 33,747 comments. We improved the Skip-gram model to train topic-enhanced word vector representations (TopicS) and then vectorized annotated text items for training the risk factor classifier. Furthermore, cluster analysis enabled a closer look into the opinions expressed in the category, namely gaining insight into how risk factors trigger AR.
RESULTS
Our classifier identified more comments containing risk factors than the other classification models, with an accuracy rate of 96.1% and a recall rate of 96.3%. In general, we clustered texts containing risk factors into 28 categories, with season, region, and mites being the most common risk factors. We gained insight into the risk factors expressed in each category; for example, seasonal changes and increased temperature differences between day and night can disrupt the body's immune system and lead to the development of allergies.
CONCLUSIONS
Our approach can handle the amount of data and extract risk factors effectively. Moreover, the summary of risk factors can serve as a reference for individuals to reduce AR in their daily lives. The experimental data also provide a potential pathway that triggers AR. This finding can guide the development of management plans and interventions for AR.
Collapse