1
|
Overview of the 8th Social Media Mining for Health Applications (#SMM4H) shared tasks at the AMIA 2023 Annual Symposium. J Am Med Inform Assoc 2024; 31:991-996. [PMID: 38218723 PMCID: PMC10990511 DOI: 10.1093/jamia/ocae010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 01/05/2024] [Accepted: 01/11/2024] [Indexed: 01/15/2024] Open
Abstract
OBJECTIVE The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. In this paper, we present the annotated corpora, a technical summary of participants' systems, and the performance results. METHODS The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of 5 tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events). RESULTS In total, 29 teams registered, representing 17 countries. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora. CONCLUSION To facilitate future work, the datasets-a total of 61 353 posts-will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.
Collapse
|
2
|
Using Longitudinal Twitter Data for Digital Epidemiology of Childhood Health Outcomes: An Annotated Data Set and Deep Neural Network Classifiers. J Med Internet Res 2024; 26:e50652. [PMID: 38526542 PMCID: PMC11002733 DOI: 10.2196/50652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 09/05/2023] [Accepted: 09/19/2023] [Indexed: 03/26/2024] Open
Abstract
We manually annotated 9734 tweets that were posted by users who reported their pregnancy on Twitter, and used them to train, evaluate, and deploy deep neural network classifiers (F1-score=0.93) to detect tweets that report having a child with attention-deficit/hyperactivity disorder (678 users), autism spectrum disorders (1744 users), delayed speech (902 users), or asthma (1255 users), demonstrating the potential of Twitter as a complementary resource for assessing associations between pregnancy exposures and childhood health outcomes on a large scale.
Collapse
|
3
|
Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review. J Med Internet Res 2024; 26:e47923. [PMID: 38488839 PMCID: PMC10980991 DOI: 10.2196/47923] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/28/2023] [Accepted: 08/01/2023] [Indexed: 03/19/2024] Open
Abstract
BACKGROUND Patient health data collected from a variety of nontraditional resources, commonly referred to as real-world data, can be a key information source for health and social science research. Social media platforms, such as Twitter (Twitter, Inc), offer vast amounts of real-world data. An important aspect of incorporating social media data in scientific research is identifying the demographic characteristics of the users who posted those data. Age and gender are considered key demographics for assessing the representativeness of the sample and enable researchers to study subgroups and disparities effectively. However, deciphering the age and gender of social media users poses challenges. OBJECTIVE This scoping review aims to summarize the existing literature on the prediction of the age and gender of Twitter users and provide an overview of the methods used. METHODS We searched 15 electronic databases and carried out reference checking to identify relevant studies that met our inclusion criteria: studies that predicted the age or gender of Twitter users using computational methods. The screening process was performed independently by 2 researchers to ensure the accuracy and reliability of the included studies. RESULTS Of the initial 684 studies retrieved, 74 (10.8%) studies met our inclusion criteria. Among these 74 studies, 42 (57%) focused on predicting gender, 8 (11%) focused on predicting age, and 24 (32%) predicted a combination of both age and gender. Gender prediction was predominantly approached as a binary classification task, with the reported performance of the methods ranging from 0.58 to 0.96 F1-score or 0.51 to 0.97 accuracy. Age prediction approaches varied in terms of classification groups, with a higher range of reported performance, ranging from 0.31 to 0.94 F1-score or 0.43 to 0.86 accuracy. The heterogeneous nature of the studies and the reporting of dissimilar performance metrics made it challenging to quantitatively synthesize results and draw definitive conclusions. CONCLUSIONS Our review found that although automated methods for predicting the age and gender of Twitter users have evolved to incorporate techniques such as deep neural networks, a significant proportion of the attempts rely on traditional machine learning methods, suggesting that there is potential to improve the performance of these tasks by using more advanced methods. Gender prediction has generally achieved a higher reported performance than age prediction. However, the lack of standardized reporting of performance metrics or standard annotated corpora to evaluate the methods used hinders any meaningful comparison of the approaches. Potential biases stemming from the collection and labeling of data used in the studies was identified as a problem, emphasizing the need for careful consideration and mitigation of biases in future studies. This scoping review provides valuable insights into the methods used for predicting the age and gender of Twitter users, along with the challenges and considerations associated with these methods.
Collapse
|
4
|
Association Between COVID-19 During Pregnancy and Preterm Birth by Trimester of Infection: A Retrospective Cohort Study Using Longitudinal Social Media Data. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.11.17.23298696. [PMID: 38045356 PMCID: PMC10690358 DOI: 10.1101/2023.11.17.23298696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
Background Preterm birth, defined as birth at <37 weeks of gestation, is the leading cause of neonatal death globally and, together with low birthweight, the second leading cause of infant mortality in the United States. There is mounting evidence that COVID-19 infection during pregnancy is associated with an increased risk of preterm birth; however, data remain limited by trimester of infection. The ability to study COVID-19 infection during the earlier stages of pregnancy has been limited by available sources of data. The objective of this study was to use self-reports in large-scale, longitudinal social media data to assess the association between trimester of COVID-19 infection and preterm birth. Methods In this retrospective cohort study, we used natural language processing and machine learning, followed by manual validation, to identify pregnant Twitter users and to search their longitudinal collection of publicly available tweets for reports of COVID-19 infection during pregnancy and, subsequently, a preterm birth or term birth (i.e., a gestational age ≥37 weeks) outcome. Among the users who reported their pregnancy on Twitter, we also identified a 1:1 age-matched control group, consisting of users with a due date prior to January 1, 2020-that is, without COVID-19 infection during pregnancy. We calculated the odds ratios (ORs) with 95% confidence intervals (CIs) to compare the overall rates of preterm birth for pregnancies with and without COVID-19 infection and by timing of infection: first trimester (weeks 1-13), second trimester (weeks 14-27), or third trimester (weeks 28-36). Results Through August 2022, we identified 298 Twitter users who reported COVID-19 infection during pregnancy, a preterm birth or term birth outcome, and maternal age: 94 (31.5%) with first-trimester infection, 110 (36.9%) second-trimester infection, and 95 (31.9%) third-trimester infection. In total, 26 (8.8%) of these 298 users reported preterm birth: 8 (8.5%) were infected during the first trimester, 7 (6.4%) were infected during the second trimester, and 12 (12.6%) were infected during the third trimester. In the 1:1 age-matched control group, 13 (4.4%) of the 298 users reported preterm birth. Overall, the risk of preterm birth was significantly higher for pregnancies with COVID-19 infection compared to those without (OR 2.1, 95% CI 1.06-4.16). In particular, the risk of preterm birth was significantly higher for pregnancies with COVID-19 infection during the third trimester (OR 3.17, CI 1.39-7.21). Conclusion The results of our study suggest that COVID-19 infection particularly during the third trimester is associated with an increased risk of preterm birth.
Collapse
|
5
|
Overview of the 8 th Social Media Mining for Health Applications (#SMM4H) Shared Tasks at the AMIA 2023 Annual Symposium. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.11.06.23298168. [PMID: 37986776 PMCID: PMC10659479 DOI: 10.1101/2023.11.06.23298168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
The aim of the Social Media Mining for Health Applications (#SMM4H) shared tasks is to take a community-driven approach to address the natural language processing and machine learning challenges inherent to utilizing social media data for health informatics. The eighth iteration of the #SMM4H shared tasks was hosted at the AMIA 2023 Annual Symposium and consisted of five tasks that represented various social media platforms (Twitter and Reddit), languages (English and Spanish), methods (binary classification, multi-class classification, extraction, and normalization), and topics (COVID-19, therapies, social anxiety disorder, and adverse drug events). In total, 29 teams registered, representing 18 countries. In this paper, we present the annotated corpora, a technical summary of the systems, and the performance results. In general, the top-performing systems used deep neural network architectures based on pre-trained transformer models. In particular, the top-performing systems for the classification tasks were based on single models that were pre-trained on social media corpora. To facilitate future work, the datasets-a total of 61,353 posts-will remain available by request, and the CodaLab sites will remain active for a post-evaluation phase.
Collapse
|
6
|
Automatically Identifying Self-Reports of COVID-19 Diagnosis on Twitter: An Annotated Data Set, Deep Neural Network Classifiers, and a Large-Scale Cohort. J Med Internet Res 2023; 25:e46484. [PMID: 37399062 PMCID: PMC10365612 DOI: 10.2196/46484] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 05/03/2023] [Accepted: 05/25/2023] [Indexed: 07/04/2023] Open
|
7
|
Pregex: Rule-Based Detection and Extraction of Twitter Data in Pregnancy. J Med Internet Res 2023; 25:e40569. [PMID: 36757756 PMCID: PMC9951068 DOI: 10.2196/40569] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 09/02/2022] [Accepted: 01/22/2023] [Indexed: 01/23/2023] Open
|
8
|
Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: Proof-of-concept With β-Blockers. JMIR Form Res 2022; 6:e36771. [PMID: 35771614 PMCID: PMC9284350 DOI: 10.2196/36771] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 04/27/2022] [Accepted: 06/06/2022] [Indexed: 01/26/2023] Open
Abstract
Background Despite the fact that medication is taken during more than 90% of pregnancies, the fetal risk for most medications is unknown, and the majority of medications have no data regarding safety in pregnancy. Objective Using β-blockers as a proof-of-concept, the primary objective of this study was to assess the utility of Twitter data for a cohort study design—in particular, whether we could identify (1) Twitter users who have posted tweets reporting that they took medication during pregnancy and (2) their associated pregnancy outcomes. Methods We searched for mentions of β-blockers in 2.75 billion tweets posted by 415,690 users who announced their pregnancy on Twitter. We manually reviewed the matching tweets to first determine if the user actually took the β-blocker mentioned in the tweet. Then, to help determine if the β-blocker was taken during pregnancy, we used the time stamp of the tweet reporting intake and drew upon an automated natural language processing (NLP) tool that estimates the date of the user’s prenatal time period. For users who posted tweets indicating that they took or may have taken the β-blocker during pregnancy, we drew upon additional NLP tools to help identify tweets that report their pregnancy outcomes. Adverse pregnancy outcomes included miscarriage, stillbirth, birth defects, preterm birth (<37 weeks gestation), low birth weight (<5 pounds and 8 ounces at delivery), and neonatal intensive care unit (NICU) admission. Normal pregnancy outcomes included gestational age ≥37 weeks and birth weight ≥5 pounds and 8 ounces. Results We retrieved 5114 tweets, posted by 2339 users, that mention a β-blocker, and manually identified 2332 (45.6%) tweets, posted by 1195 (51.1%) of the users, that self-report taking the β-blocker. We were able to estimate the date of the prenatal time period for 356 pregnancies among 334 (27.9%) of these 1195 users. Among these 356 pregnancies, we identified 257 (72.2%) during which the β-blocker was or may have been taken. We manually verified an adverse pregnancy outcome—preterm birth, NICU admission, low birth weight, birth defects, or miscarriage—for 38 (14.8%) of these 257 pregnancies. We manually verified a gestational age ≥37 weeks for 198 (90.4%) and a birth weight ≥5 pounds and 8 ounces for 50 (22.8%) of the 219 pregnancies for which we did not identify an adverse pregnancy outcome. Conclusions Our ability to detect pregnancy outcomes for Twitter users who posted tweets reporting that they took or may have taken a β-blocker during pregnancy suggests that Twitter can be a complementary resource for cohort studies of drug safety in pregnancy.
Collapse
|
9
|
Automatically Identifying Twitter Users for Interventions to Support Dementia Family Caregivers: Annotated Data Set and Benchmark Classification Models (Preprint). JMIR Aging 2022; 5:e39547. [PMID: 36112408 PMCID: PMC9526111 DOI: 10.2196/39547] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Revised: 07/08/2022] [Accepted: 07/08/2022] [Indexed: 12/21/2022] Open
Abstract
Background More than 6 million people in the United States have Alzheimer disease and related dementias, receiving help from more than 11 million family or other informal caregivers. A range of traditional interventions has been developed to support family caregivers; however, most of them have not been implemented in practice and remain largely inaccessible. While recent studies have shown that family caregivers of people with dementia use Twitter to discuss their experiences, methods have not been developed to enable the use of Twitter for interventions. Objective The objective of this study is to develop an annotated data set and benchmark classification models for automatically identifying a cohort of Twitter users who have a family member with dementia. Methods Between May 4 and May 20, 2021, we collected 10,733 tweets, posted by 8846 users, that mention a dementia-related keyword, a linguistic marker that potentially indicates a diagnosis, and a select familial relationship. Three annotators annotated 1 random tweet per user to distinguish those that indicate having a family member with dementia from those that do not. Interannotator agreement was 0.82 (Fleiss kappa). We used the annotated tweets to train and evaluate support vector machine and deep neural network classifiers. To assess the scalability of our approach, we then deployed automatic classification on unlabeled tweets that were continuously collected between May 4, 2021, and March 9, 2022. Results A deep neural network classifier based on a BERT (bidirectional encoder representations from transformers) model pretrained on tweets achieved the highest F1-score of 0.962 (precision=0.946 and recall=0.979) for the class of tweets indicating that the user has a family member with dementia. The classifier detected 128,838 tweets that indicate having a family member with dementia, posted by 74,290 users between May 4, 2021, and March 9, 2022—that is, approximately 7500 users per month. Conclusions Our annotated data set can be used to automatically identify Twitter users who have a family member with dementia, enabling the use of Twitter on a large scale to not only explore family caregivers’ experiences but also directly target interventions at these users.
Collapse
|
10
|
Toward Using Twitter for PrEP-Related Interventions: An Automated Natural Language Processing Pipeline for Identifying Gay or Bisexual Men in the United States. JMIR Public Health Surveill 2022; 8:e32405. [PMID: 35468092 PMCID: PMC9086871 DOI: 10.2196/32405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 11/19/2021] [Accepted: 02/24/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Pre-exposure prophylaxis (PrEP) is highly effective at preventing the acquisition of HIV. There is a substantial gap, however, between the number of people in the United States who have indications for PrEP and the number of them who are prescribed PrEP. Although Twitter content has been analyzed as a source of PrEP-related data (eg, barriers), methods have not been developed to enable the use of Twitter as a platform for implementing PrEP-related interventions. OBJECTIVE Men who have sex with men (MSM) are the population most affected by HIV in the United States. Therefore, the objectives of this study were to (1) develop an automated natural language processing (NLP) pipeline for identifying men in the United States who have reported on Twitter that they are gay, bisexual, or MSM and (2) assess the extent to which they demographically represent MSM in the United States with new HIV diagnoses. METHODS Between September 2020 and January 2021, we used the Twitter Streaming Application Programming Interface (API) to collect more than 3 million tweets containing keywords that men may include in posts reporting that they are gay, bisexual, or MSM. We deployed handwritten, high-precision regular expressions-designed to filter out noise and identify actual self-reports-on the tweets and their user profile metadata. We identified 10,043 unique users geolocated in the United States and drew upon a validated NLP tool to automatically identify their ages. RESULTS By manually distinguishing true- and false-positive self-reports in the tweets or profiles of 1000 (10%) of the 10,043 users identified by our automated pipeline, we established that our pipeline has a precision of 0.85. Among the 8756 users for which a US state-level geolocation was detected, 5096 (58.2%) were in the 10 states with the highest numbers of new HIV diagnoses. Among the 6240 users for which a county-level geolocation was detected, 4252 (68.1%) were in counties or states considered priority jurisdictions by the Ending the HIV Epidemic initiative. Furthermore, the age distribution of the users reflected that of MSM in the United States with new HIV diagnoses. CONCLUSIONS Our automated NLP pipeline can be used to identify MSM in the United States who may be at risk of acquiring HIV, laying the groundwork for using Twitter on a large scale to directly target PrEP-related interventions at this population.
Collapse
|
11
|
ReportAGE: Automatically extracting the exact age of Twitter users based on self-reports in tweets. PLoS One 2022; 17:e0262087. [PMID: 35077484 PMCID: PMC8789116 DOI: 10.1371/journal.pone.0262087] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Accepted: 12/17/2021] [Indexed: 11/18/2022] Open
Abstract
Advancing the utility of social media data for research applications requires methods for automatically detecting demographic information about social media study populations, including users' age. The objective of this study was to develop and evaluate a method that automatically identifies the exact age of users based on self-reports in their tweets. Our end-to-end automatic natural language processing (NLP) pipeline, ReportAGE, includes query patterns to retrieve tweets that potentially mention an age, a classifier to distinguish retrieved tweets that self-report the user's exact age ("age" tweets) and those that do not ("no age" tweets), and rule-based extraction to identify the age. To develop and evaluate ReportAGE, we manually annotated 11,000 tweets that matched the query patterns. Based on 1000 tweets that were annotated by all five annotators, inter-annotator agreement (Fleiss' kappa) was 0.80 for distinguishing "age" and "no age" tweets, and 0.95 for identifying the exact age among the "age" tweets on which the annotators agreed. A deep neural network classifier, based on a RoBERTa-Large pretrained transformer model, achieved the highest F1-score of 0.914 (precision = 0.905, recall = 0.942) for the "age" class. When the age extraction was evaluated using the classifier's predictions, it achieved an F1-score of 0.855 (precision = 0.805, recall = 0.914) for the "age" class. When it was evaluated directly on the held-out test set, it achieved an F1-score of 0.931 (precision = 0.873, recall = 0.998) for the "age" class. We deployed ReportAGE on a collection of more than 1.2 billion tweets, posted by 245,927 users, and predicted ages for 132,637 (54%) of them. Scaling the detection of exact age to this large number of users can advance the utility of social media data for research applications that do not align with the predefined age groupings of extant binary or multi-class classification approaches.
Collapse
|
12
|
Toward Using Twitter Data to Monitor COVID-19 Vaccine Safety in Pregnancy: Proof-of-Concept Study of Cohort Identification. JMIR Form Res 2022; 6:e33792. [PMID: 34870607 PMCID: PMC8734607 DOI: 10.2196/33792] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 11/15/2021] [Accepted: 11/22/2021] [Indexed: 01/19/2023] Open
Abstract
Background COVID-19 during pregnancy is associated with an increased risk of maternal death, intensive care unit admission, and preterm birth; however, many people who are pregnant refuse to receive COVID-19 vaccination because of a lack of safety data. Objective The objective of this preliminary study was to assess whether Twitter data could be used to identify a cohort for epidemiologic studies of COVID-19 vaccination in pregnancy. Specifically, we examined whether it is possible to identify users who have reported (1) that they received COVID-19 vaccination during pregnancy or the periconception period, and (2) their pregnancy outcomes. Methods We developed regular expressions to search for reports of COVID-19 vaccination in a large collection of tweets posted through the beginning of July 2021 by users who have announced their pregnancy on Twitter. To help determine if users were vaccinated during pregnancy, we drew upon a natural language processing (NLP) tool that estimates the timeframe of the prenatal period. For users who posted tweets with a timestamp indicating they were vaccinated during pregnancy, we drew upon additional NLP tools to help identify tweets that reported their pregnancy outcomes. Results We manually verified the content of tweets detected automatically, identifying 150 users who reported on Twitter that they received at least one dose of COVID-19 vaccination during pregnancy or the periconception period. We manually verified at least one reported outcome for 45 of the 60 (75%) completed pregnancies. Conclusions Given the limited availability of data on COVID-19 vaccine safety in pregnancy, Twitter can be a complementary resource for potentially increasing the acceptance of COVID-19 vaccination in pregnant populations. The results of this preliminary study justify the development of scalable methods to identify a larger cohort for epidemiologic studies.
Collapse
|
13
|
A chronological and geographical analysis of personal reports of COVID-19 on Twitter from the UK. Digit Health 2022; 8:20552076221097508. [PMID: 35574580 PMCID: PMC9096830 DOI: 10.1177/20552076221097508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Accepted: 04/12/2022] [Indexed: 11/30/2022] Open
Abstract
Objective Given the uncertainty about the trends and extent of the rapidly evolving COVID-19 outbreak, and the lack of extensive testing in the United Kingdom, our understanding of COVID-19 transmission is limited. We proposed to use Twitter to identify personal reports of COVID-19 to assess whether this data can help inform as a source of data to help us understand and model the transmission and trajectory of COVID-19. Methods We used natural language processing and machine learning framework. We collected tweets (excluding retweets) from the Twitter Streaming API that indicate that the user or a member of the user's household had been exposed to COVID-19. The tweets were required to be geo-tagged or have profile location metadata in the UK. Results We identified a high level of agreement between personal reports from Twitter and lab-confirmed cases by geographical region in the UK. Temporal analysis indicated that personal reports from Twitter appear up to 2 weeks before UK government lab-confirmed cases are recorded. Conclusions Analysis of tweets may indicate trends in COVID-19 in the UK and provide signals of geographical locations where resources may need to be targeted or where regional policies may need to be put in place to further limit the spread of COVID-19. It may also help inform policy makers of the restrictions in lockdown that are most effective or ineffective.
Collapse
|
14
|
Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set. J Med Internet Res 2021; 23:e25314. [PMID: 33449904 PMCID: PMC7834613 DOI: 10.2196/25314] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 12/14/2020] [Accepted: 12/14/2020] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out "reported speech" (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state-level geolocations. CONCLUSIONS We have made the 13,714 tweets identified in this study, along with each tweet's time stamp and US state-level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
Collapse
|
15
|
An annotated data set for identifying women reporting adverse pregnancy outcomes on Twitter. Data Brief 2020; 32:106249. [PMID: 32944604 PMCID: PMC7481818 DOI: 10.1016/j.dib.2020.106249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 08/25/2020] [Indexed: 10/29/2022] Open
Abstract
Despite the prevalence in the United States of miscarriage [1], stillbirth [2], and infant mortality associated with preterm birth and low birthweight [3], their causes remain largely unknown [4], [5], [6]. To advance the use of social media data as a complementary resource for epidemiology of adverse pregnancy outcomes, we present a data set of 6487 tweets that mention miscarriage, stillbirth, preterm birth or premature labor, low birthweight, neonatal intensive care, or fetal/infant loss in general. These tweets are a subset of 22,912 tweets retrieved by applying hand-written regular expressions to a database containing more than 400 million public tweets posted by more than 100,000 women who have announced their pregnancy on Twitter [7]. Two professional annotators labeled the 6487 tweets in a binary fashion, distinguishing those potentially reporting that the user has personally experienced the outcome ("outcome" tweets) from those that merely mention the outcome ("non-outcome" tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). The tweets annotated as "outcome" include 1318 women reporting miscarriage, 94 stillbirth, 591 preterm birth or premature labor, 171 low birthweight, 453 neonatal intensive care, and 356 fetal/infant loss in general. These "outcome" tweets can be used to explore patient experiences and perceptions of adverse pregnancy outcomes, and can direct researchers to the users' broader timelines-tweets posted by a user over time-for observational studies. Our past work demonstrates the analysis of timelines for selecting a study population [8] and conducting a case-control study [9] of users reporting that their child has a birth defect. For larger-scale studies, the full annotated corpus can be used to train supervised machine learning algorithms to automatically identify additional users reporting adverse pregnancy outcomes on Twitter. We used the annotated corpus to train feature-engineered and deep learning-based classifiers presented in "A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes" [10].
Collapse
|
16
|
Towards Automatic Bot Detection in Twitter for Health-related Tasks. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:136-141. [PMID: 32477632 PMCID: PMC7233076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With the increasing use of social media data for health-related research, the credibility of the information from this source has been questioned as the posts may not from originating personal accounts. While automatic bot detection approaches have been proposed, none have been evaluated on users posting health-related information. In this paper, we extend an existing bot detection system and customize it for health-related research. Using a dataset of Twitter users, we first show that the system, which was designed for political bot detection, underperforms when applied to health-related Twitter users. We then incorporate additional features and a statistical machine learning classifier to improve bot detection performance significantly. Our approach obtains F1-scores of 0.7 for the "bot" class, representing improvements of 0.339. Our approach is customizable and generalizable for bot detection in other health-related social media cohorts.
Collapse
|
17
|
Automatically Identifying Comparator Groups on Twitter for Digital Epidemiology of Pregnancy Outcomes. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:317-325. [PMID: 32477651 PMCID: PMC7233041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Despite the prevalence of adverse pregnancy outcomes such as miscarriage, stillbirth, birth defects, and preterm birth, their causes are largely unknown. We seek to advance the use of social media for observational studies of pregnancy outcomes by developing a natural language processing pipeline for automatically identifying users from which to select comparator groups on Twitter. We annotated 2361 tweets by users who have announced their pregnancy on Twitter, which were used to train and evaluate supervised machine learning algorithms as a basis for automatically detecting women who have reported that their pregnancy had reached term and their baby was born at a normal weight. Upon further processing the tweet-level predictions of a majority voting-based ensemble classifier, the pipeline achieved a user-level F1-score of 0.933 (precision = 0.947, recall = 0.920). Our pipeline will be deployed to identify large comparator groups for studying pregnancy outcomes on Twitter.
Collapse
|
18
|
Extending A Chronological and Geographical Analysis of Personal Reports of COVID-19 on Twitter to England, UK. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2020:2020.05.05.20083436. [PMID: 32511492 PMCID: PMC7273260 DOI: 10.1101/2020.05.05.20083436] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The rapidly evolving COVID-19 pandemic presents challenges for actively monitoring its transmission. In this study, we extend a social media mining approach used in the US to automatically identify personal reports of COVID-19 on Twitter in England, UK. The findings indicate that natural language processing and machine learning framework could help provide an early indication of the chronological and geographical distribution of COVID-19 in England.
Collapse
|
19
|
Klein AZ, Magge A, O’connor K, Cai H, Weissenbacher D, Gonzalez-hernandez G. A Chronological and Geographical Analysis of Personal Reports of COVID-19 on Twitter.. [PMID: 32511608 PMCID: PMC7276035 DOI: 10.1101/2020.04.19.20069948] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
The rapidly evolving outbreak of COVID-19 presents challenges for actively monitoring its spread. In this study, we assessed a social media mining approach for automatically analyzing the chronological and geographical distribution of users in the United States reporting personal information related to COVID-19 on Twitter. The results suggest that our natural language processing and machine learning framework could help provide an early indication of the spread of COVID-19.
Collapse
|
20
|
A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes. J Biomed Inform 2020; 112S:100076. [PMID: 34417007 DOI: 10.1016/j.yjbinx.2020.100076] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 06/30/2020] [Accepted: 07/27/2020] [Indexed: 10/23/2022]
Abstract
BACKGROUND In the United States, 17% of pregnancies end in fetal loss: miscarriage or stillbirth. Preterm birth affects 10% of live births in the United States and is the leading cause of neonatal death globally. Preterm births with low birthweight are the second leading cause of infant mortality in the United States. Despite their prevalence, the causes of miscarriage, stillbirth, and preterm birth are largely unknown. OBJECTIVE The primary objectives of this study are to (1) assess whether women report miscarriage, stillbirth, and preterm birth, among others, on Twitter, and (2) develop natural language processing (NLP) methods to automatically identify users from which to select cases for large-scale observational studies. METHODS We handcrafted regular expressions to retrieve tweets that mention an adverse pregnancy outcome, from a database containing more than 400 million publicly available tweets posted by more than 100,000 users who have announced their pregnancy on Twitter. Two annotators independently annotated 8109 (one random tweet per user) of the 22,912 retrieved tweets, distinguishing those reporting that the user has personally experienced the outcome ("outcome" tweets) from those that merely mention the outcome ("non-outcome" tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). We used the annotated tweets to train and evaluate feature-engineered and deep learning-based classifiers. We further annotated 7512 (of the 8109) tweets to develop a generalizable, rule-based module designed to filter out reported speech-that is, posts containing what was said by others-prior to automatic classification. We performed an extrinsic evaluation assessing whether the reported speech filter could improve the detection of women reporting adverse pregnancy outcomes on Twitter. RESULTS The tweets annotated as "outcome" include 1632 women reporting miscarriage, 119 stillbirth, 749 preterm birth or premature labor, 217 low birthweight, 558 NICU admission, and 458 fetal/infant loss in general. A deep neural network, BERT-based classifier achieved the highest overall F1-score (0.88) for automatically detecting "outcome" tweets (precision = 0.87, recall = 0.89), with an F1-score of at least 0.82 and a precision of at least 0.84 for each of the adverse pregnancy outcomes. Our reported speech filter significantly (P < 0.05) improved the accuracy of Logistic Regression (from 78.0% to 80.8%) and majority voting-based ensemble (from 81.1% to 82.9%) classifiers. Although the filter did not improve the F1-score of the BERT-based classifier, it did improve precision-a trade-off of recall that may be acceptable for automated case selection of more prevalent outcomes. Without the filter, reported speech is one of the main sources of errors for the BERT-based classifier. CONCLUSION This study demonstrates that (1) women do report their adverse pregnancy outcomes on Twitter, (2) our NLP pipeline can automatically identify users from which to select cases for large-scale observational studies, and (3) our reported speech filter would reduce the cost of annotating health-related social media data and can significantly improve the overall performance of feature-based classifiers.
Collapse
|
21
|
Towards scaling Twitter for digital epidemiology of birth defects. NPJ Digit Med 2019; 2:96. [PMID: 31583284 PMCID: PMC6773753 DOI: 10.1038/s41746-019-0170-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Accepted: 08/12/2019] [Indexed: 11/13/2022] Open
Abstract
Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes-the leading cause of infant mortality-could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms-feature-engineered and deep learning-based classifiers-that automatically distinguish tweets referring to the user's pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the "defect" class and 0.51 for the "possible defect" class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.
Collapse
|
22
|
An Analysis of a Twitter Corpus for Training a Medication Intake Classifier. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2019; 2019:102-106. [PMID: 31258961 PMCID: PMC6568126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
While social media has evolved into a useful resource for studying medication-related information, observational studies of medications have continued to rely on other sources of data. Towards advancing the use of social media data for medication-related observational studies, we analyze an annotated corpus of 27,941 tweets designed for training machine learning algorithms to automatically detect users' medication intake. In particular, we assess how a baseline classifier trained on the general corpus-that is, on various types of medication-performs for specific types. For most types, the classifier performs significantly better than it does overall; however, for nervous system medications, it performs significantly worse. These results suggest that, while the general corpus may have utility for observational studies focusing on most types of medication, studying nervous system medications may benefit from training a classifier exclusively for this type. We will explore this data-level approach in future work.
Collapse
|
23
|
Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter. J Biomed Inform 2018; 87:68-78. [PMID: 30292855 PMCID: PMC6295660 DOI: 10.1016/j.jbi.2018.10.001] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Revised: 09/26/2018] [Accepted: 10/03/2018] [Indexed: 10/28/2022]
Abstract
BACKGROUND Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. OBJECTIVE The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. METHODS To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user's child has a birth defect, and (ii) accessibility to the user's tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. RESULTS We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user's child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: κ = 0.79 (Cohen's kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95. CONCLUSIONS Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.
Collapse
|