1
|
Leung T, Kasson E, Singh AK, Ren Y, Kaiser N, Huang M, Cavazos-Rehg PA. Topics and Sentiment Surrounding Vaping on Twitter and Reddit During the 2019 e-Cigarette and Vaping Use-Associated Lung Injury Outbreak: Comparative Study. J Med Internet Res 2022; 24:e39460. [PMID: 36512403 PMCID: PMC9795395 DOI: 10.2196/39460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 09/16/2022] [Accepted: 10/29/2022] [Indexed: 11/05/2022] Open
Abstract
BACKGROUND Vaping or e-cigarette use has become dramatically more popular in the United States in recent years. e-Cigarette and vaping use-associated lung injury (EVALI) cases caused an increase in hospitalizations and deaths in 2019, and many instances were later linked to unregulated products. Previous literature has leveraged social media data for surveillance of health topics. Individuals are willing to share mental health experiences and other personal stories on social media platforms where they feel a sense of community, reduced stigma, and empowerment. OBJECTIVE This study aimed to compare vaping-related content on 2 popular social media platforms (ie, Twitter and Reddit) to explore the context surrounding vaping during the 2019 EVALI outbreak and to support the feasibility of using data from both social platforms to develop in-depth and intelligent vaping detection models on social media. METHODS Data were extracted from both Twitter (316,620 tweets) and Reddit (17,320 posts) from July 2019 to September 2019 at the peak of the EVALI crisis. High-throughput computational analyses (sentiment analysis and topic analysis) were conducted. In addition, in-depth manual content analyses were performed and compared with computational analyses of content on both platforms (577 tweets and 613 posts). RESULTS Vaping-related posts and unique users on Twitter and Reddit increased from July 2019 to September 2019, with the average post per user increasing from 1.68 to 1.81 on Twitter and 1.19 to 1.21 on Reddit. Computational analyses found the number of positive sentiment posts to be higher on Reddit (P<.001, 95% CI 0.4305-0.4475) and the number of negative posts to be higher on Twitter (P<.001, 95% CI -0.4289 to -0.4111). These results were consistent with the clinical content analyses results indicating that negative sentiment posts were higher on Twitter (273/577, 47.3%) than Reddit (184/613, 30%). Furthermore, topics prevalent on both platforms by keywords and based on manual post reviews included mentions of youth, marketing or regulation, marijuana, and interest in quitting. CONCLUSIONS Post content and trending topics overlapped on both Twitter and Reddit during the EVALI period in 2019. However, crucial differences in user type and content keywords were also found, including more frequent mentions of health-related keywords on Twitter and more negative health outcomes from vaping mentioned on both Reddit and Twitter. Use of both computational and clinical content analyses is critical to not only identify signals of public health trends among vaping-related social media content but also to provide context for vaping risks and behaviors. By leveraging the strengths of both Twitter and Reddit as publicly available data sources, this research may provide technical and clinical insights to inform automatic detection of social media users who are vaping and may benefit from digital intervention and proactive outreach strategies on these platforms.
Collapse
Affiliation(s)
| | - Erin Kasson
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, United States
| | - Avineet Kumar Singh
- Department of Integrated Information Technology, University of South Carolina, Columbia, SC, United States
| | - Yang Ren
- Department of Integrated Information Technology, University of South Carolina, Columbia, SC, United States
| | - Nina Kaiser
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, United States
| | - Ming Huang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, United States
| | - Patricia A Cavazos-Rehg
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, United States
| |
Collapse
|
2
|
Luo L, Wang Y, Liu H. COVID-19 personal health mention detection from tweets using dual convolutional neural network. EXPERT SYSTEMS WITH APPLICATIONS 2022; 200:117139. [PMID: 35399189 PMCID: PMC8976569 DOI: 10.1016/j.eswa.2022.117139] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 01/13/2022] [Accepted: 03/29/2022] [Indexed: 05/05/2023]
Abstract
Twitter offers extensive and valuable information on the spread of COVID-19 and the current state of public health. Mining tweets could be an important supplement for public health departments in monitoring the status of COVID-19 in a timely manner and taking the appropriate actions to minimize its impact. Identifying personal health mentions (PHM) is the first step of social media public health surveillance. It aims to identify whether a person's health condition is mentioned in a tweet, and it serves as a crucial method in tracking pandemic conditions in real time. However, social media texts contain noise, many creative and novel phrases, sarcastic emoji expressions, and misspellings. In addition, the class imbalance issue is usually very serious. To address these challenges, we built a COVID-19 PHM dataset containing more than 11,000 annotated tweets, and we proposed a dual convolutional neural network (CNN) framework using this dataset. An auxiliary CNN in the dual CNN structure provides supplemental information for the primary CNN in order to detect PHMs from tweets more effectively. The experiment shows that the proposed structure could alleviate the effect of class imbalance and could achieve promising results. This automated approach could monitor public health in real time and save disease-prevention departments from the tedious manual work in public health surveillance.
Collapse
Affiliation(s)
- Linkai Luo
- Department of Supply Chain and Information Management, The Hang Seng University of Hong Kong, Hong Kong Special Administrative Region
| | - Yue Wang
- Department of Supply Chain and Information Management, The Hang Seng University of Hong Kong, Hong Kong Special Administrative Region
| | - Hai Liu
- Department of Computing, The Hang Seng University of Hong Kong, Hong Kong Special Administrative Region
| |
Collapse
|
3
|
Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: The case of gluten bibliome. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.10.100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
4
|
Ren Y, Wu D, Singh A, Kasson E, Huang M, Cavazos-Rehg P. Automated Detection of Vaping-Related Tweets on Twitter During the 2019 EVALI Outbreak Using Machine Learning Classification. Front Big Data 2022; 5:770585. [PMID: 35224484 PMCID: PMC8866955 DOI: 10.3389/fdata.2022.770585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2021] [Accepted: 01/13/2022] [Indexed: 11/15/2022] Open
Abstract
There are increasingly strict regulations surrounding the purchase and use of combustible tobacco products (i.e., cigarettes); simultaneously, the use of other tobacco products, including e-cigarettes (i.e., vaping products), has dramatically increased. However, public attitudes toward vaping vary widely, and the health effects of vaping are still largely unknown. As a popular social media, Twitter contains rich information shared by users about their behaviors and experiences, including opinions on vaping. It is very challenging to identify vaping-related tweets to source useful information manually. In the current study, we proposed to develop a detection model to accurately identify vaping-related tweets using machine learning and deep learning methods. Specifically, we applied seven popular machine learning and deep learning algorithms, including Naïve Bayes, Support Vector Machine, Random Forest, XGBoost, Multilayer Perception, Transformer Neural Network, and stacking and voting ensemble models to build our customized classification model. We extracted a set of sample tweets during an outbreak of e-cigarette or vaping-related lung injury (EVALI) in 2019 and created an annotated corpus to train and evaluate these models. After comparing the performance of each model, we found that the stacking ensemble learning achieved the highest performance with an F1-score of 0.97. All models could achieve 0.90 or higher after tuning hyperparameters. The ensemble learning model has the best average performance. Our study findings provide informative guidelines and practical implications for the automated detection of themed social media data for public opinions and health surveillance purposes.
Collapse
Affiliation(s)
- Yang Ren
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Dezhi Wu
- Department of Integrated Information Technology, University of South Carolina, Columbia, SC, United States
| | - Avineet Singh
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Erin Kasson
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, United States
| | - Ming Huang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, United States
| | - Patricia Cavazos-Rehg
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, United States
| |
Collapse
|
5
|
Kentour M, Lu J. An investigation into the deep learning approach in sentimental analysis using graph-based theories. PLoS One 2021; 16:e0260761. [PMID: 34855856 PMCID: PMC8638889 DOI: 10.1371/journal.pone.0260761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Accepted: 11/16/2021] [Indexed: 11/24/2022] Open
Abstract
Sentiment analysis is a branch of natural language analytics that aims to correlate what is expressed which comes normally within unstructured format with what is believed and learnt. Several attempts have tried to address this gap (i.e., Naive Bayes, RNN, LSTM, word embedding, etc.), even though the deep learning models achieved high performance, their generative process remains a "black-box" and not fully disclosed due to the high dimensional feature and the non-deterministic weights assignment. Meanwhile, graphs are becoming more popular when modeling complex systems while being traceable and understood. Here, we reveal that a good trade-off transparency and efficiency could be achieved with a Deep Neural Network by exploring the Credit Assignment Paths theory. To this end, we propose a novel algorithm which alleviates the features' extraction mechanism and attributes an importance level of selected neurons by applying a deterministic edge/node embeddings with attention scores on the input unit and backward path respectively. We experiment on the Twitter Health News dataset were the model has been extended to approach different approximations (tweet/aspect and tweets' source levels, frequency, polarity/subjectivity), it was also transparent and traceable. Moreover, results of comparing with four recent models on same data corpus for tweets analysis showed a rapid convergence with an overall accuracy of ≈83% and 94% of correctly identified true positive sentiments. Therefore, weights can be ideally assigned to specific active features by following the proposed method. As opposite to other compared works, the inferred features are conditioned through the users' preferences (i.e., frequency degree) and via the activation's derivatives (i.e., reject feature if not scored). Future direction will address the inductive aspect of graph embeddings to include dynamic graph structures and expand the model resiliency by considering other datasets like SemEval task7, covid-19 tweets, etc.
Collapse
Affiliation(s)
- Mohamed Kentour
- School of Computing and Engineering, University of Huddersfield, Huddersfield, West- Yorkshire, United Kingdom
| | - Joan Lu
- School of Computing and Engineering, University of Huddersfield, Huddersfield, West- Yorkshire, United Kingdom
| |
Collapse
|
6
|
Classifying patient and professional voice in social media health posts. BMC Med Inform Decis Mak 2021; 21:244. [PMID: 34407807 PMCID: PMC8371035 DOI: 10.1186/s12911-021-01577-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 07/06/2021] [Indexed: 11/10/2022] Open
Abstract
Background Patient-based analysis of social media is a growing research field with the aim of delivering precision medicine but it requires accurate classification of posts relating to patients’ experiences. We motivate the need for this type of classification as a pre-processing step for further analysis of social media data in the context of related work in this area. In this paper we present experiments for a three-way document classification by patient voice, professional voice or other. We present results for a convolutional neural network classifier trained on English data from two different data sources (Reddit and Twitter) and two domains (cardiovascular and skin diseases). Results We found that document classification by patient voice, professional voice or other can be done consistently manually (0.92 accuracy). Annotators agreed roughly equally for each domain (cardiovascular and skin) but they agreed more when annotating Reddit posts compared to Twitter posts. Best classification performance was obtained when training two separate classifiers for each data source, one for Reddit and one for Twitter posts, when evaluating on in-source test data for both test sets combined with an overall accuracy of 0.95 (and macro-average F1 of 0.92) and an F1-score of 0.95 for patient voice only. Conclusion The main conclusion resulting from this work is that combining social media data from platforms with different characteristics for training a patient and professional voice classifier does not result in best possible performance. We showed that it is best to train separate models per data source (Reddit and Twitter) instead of a model using the combined training data from both sources. We also found that it is preferable to train separate models per domain (cardiovascular and skin) while showing that the difference to the combined model is only minor (0.01 accuracy). Our highest overall F1-score (0.95) obtained for classifying posts as patient voice is a very good starting point for further analysis of social media data reflecting the experience of patients. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01577-9.
Collapse
|
7
|
Using BiLSTM Networks for Context-Aware Deep Sensitivity Labelling on Conversational Data. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10248924] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Information privacy is a critical design feature for any exchange system, with privacy-preserving applications requiring, most of the time, the identification and labelling of sensitive information. However, privacy and the concept of “sensitive information” are extremely elusive terms, as they are heavily dependent upon the context they are conveyed in. To accommodate such specificity, we first introduce a taxonomy of four context classes to categorise relationships of terms with their textual surroundings by meaning, interaction, precedence, and preference. We then propose a predictive context-aware model based on a Bidirectional Long Short Term Memory network with Conditional Random Fields (BiLSTM + CRF) to identify and label sensitive information in conversational data (multi-class sensitivity labelling). We train our model on a synthetic annotated dataset of real-world conversational data categorised in 13 sensitivity classes that we derive from the P3P standard. We parameterise and run a series of experiments featuring word and character embeddings and introduce a set of auxiliary features to improve model performance. Our results demonstrate that the BiLSTM + CRF model architecture with BERT embeddings and WordShape features is the most effective (F1 score 96.73%). Evaluation of the model is conducted under both temporal and semantic contexts, achieving a 76.33% F1 score on unseen data and outperforms Google’s Data Loss Prevention (DLP) system on sensitivity labelling tasks.
Collapse
|
8
|
Waheeb SA, Ahmed Khan N, Chen B, Shang X. Machine Learning Based Sentiment Text Classification for Evaluating Treatment Quality of Discharge Summary. INFORMATION 2020; 11:281. [DOI: 10.3390/info11050281] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2024] Open
Abstract
Patients’ discharge summaries (documents) are health sensors that are used for measuring the quality of treatment in medical centers. However, extracting information automatically from discharge summaries with unstructured natural language is considered challenging. These kinds of documents include various aspects of patient information that could be used to test the treatment quality for improving medical-related decisions. One of the significant techniques in literature for discharge summaries classification is feature extraction techniques from the domain of natural language processing on text data. We propose a novel sentiment analysis method for discharge summaries classification that relies on vector space models, statistical methods, association rule, and extreme learning machine autoencoder (ELM-AE). Our novel hybrid model is based on statistical methods that build the lexicon in a domain related to health and medical records. Meanwhile, our method examines treatment quality based on an idea inspired by sentiment analysis. Experiments prove that our proposed method obtains a higher F1 value of 0.89 with good TPR (True Positive Rate) and FPR (False Positive Rate) values compared with various well-known state-of-the-art methods with different size of training and testing datasets. The results also prove that our method provides a flexible and effective technique to examine treatment quality based on positive, negative, and neutral terms for sentence-level in each discharge summary.
Collapse
Affiliation(s)
- Samer Abdulateef Waheeb
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, China
| | - Naseer Ahmed Khan
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, China
| | - Bolin Chen
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, China
| | - Xuequn Shang
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an 710072, China
| |
Collapse
|
9
|
Jiang K, Chen T, Calix RA, Bernard GR. Prediction of Personal Experience Tweets of Medication Use via Contextual Word Representations .. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2019:6093-6096. [PMID: 31947235 DOI: 10.1109/embc.2019.8856753] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Continuous monitoring the safe use of medication is an important task in pharmacovigilance. The first-hand experiences of medication effects come from the consumers of the pharmaceuticals. Social media have been considered as a possible alternative data source for gathering consumer-generated information of their experience with medications. Identifying personal experience in social media data is a challenging task in natural language processing. In this study, we investigated a method of predicating personal experience tweets using Google's Bidirectional Encoder Representations from Transformers (BERT) and neural networks, in which BERT models contextually represented the tweet text. Both pre-trained BERT models and our BERT model trained with 3.2 million unlabeled tweets were examined. Our results show that our trained BERT model performs better than Google's pre-trained models (p <; 0.01). This suggests that domain-specific data may contribute to the BERT model yielding better classification performance in predicting personal experience tweets of medication use.
Collapse
|
10
|
Roy PK, Singh JP. Predicting closed questions on community question answering sites using convolutional neural network. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04592-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
11
|
Rodríguez-Martínez M, Garzón-Alfonso CC. Twitter Health Surveillance (THS) System. PROCEEDINGS : ... IEEE INTERNATIONAL CONFERENCE ON BIG DATA. IEEE INTERNATIONAL CONFERENCE ON BIG DATA 2019; 2018:1647-1654. [PMID: 30706061 DOI: 10.1109/bigdata.2018.8622504] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We present the Twitter Health Surveillance (THS) application framework. THS is designed as an integrated platform to help health officials collect tweets, determine if they are related with a medical condition, extract metadata out of them, and create a big data warehouse that can be used to further analyze the data. THS is built atop open source tools and provides the following value added services: Data Acquisition, Tweet Classification, and Big Data Warehousing. In order to validate THS, we have created a collection of roughly twelve thousands labelled tweets. These tweets contain one or more target medical terms, and the labels indicate if the tweet is related or not to a medical condition. We used this collection to test various models based on LSTM and GRU recurrent neural networks. Our experiments show that we can classify tweets with 96% precision, 92% recall, and 91% F1 score. These results compare favorably with recent research on this area, and show the promise of our THS system.
Collapse
|