1
|
Tsao SF, Chen H, Butt ZA. Validating part of the social media infodemic listening conceptual framework using structural equation modelling. EClinicalMedicine 2024; 70:102544. [PMID: 38516101 PMCID: PMC10955635 DOI: 10.1016/j.eclinm.2024.102544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
Background The literature has identified various factors that promote or hinder people's intentions towards COVID-19 vaccination, and structural equation modelling (SEM) is a common approach to validate these associations. We propose a conceptual framework called social media infodemic listening (SoMeIL) for public health behaviours. Hypothesizing parameters retrieved from social media platforms can be used to infer people's intentions towards vaccination behaviours. This study preliminarily validates several components of the SoMeIL conceptual framework using SEM and Twitter data and examines the feasibility of using Twitter data in SEM research. Methods A total of 2420 English tweets in Toronto or Ottawa, Ontario, Canada, were collected from March 8 to June 30, 2021. Confirmatory factor analysis and SEM were applied to validate the SoMeIL conceptual framework in this cross-sectional study. Findings The results showed that sentiment scores, the log-numbers of favourites and retweets of a tweet, and the log-numbers of a user's favourites, followers, and public lists had significant direct associations with COVID-19 vaccination intention. The sentiment score of a tweet had the strongest relationship, whereas a user's number of followers had the weakest relationship with the intention of COVID-19 vaccine uptake. Interpretation The findings preliminarily validate several components of the SoMeIL conceptual framework by testing associations between self-reported COVID-19 vaccination intention and sentiment scores and the log-numbers of a tweet's favourites and retweets as well as users' favourites, followers, and public lists. This study also demonstrates the feasibility of using Twitter data in SEM research. Importantly, this study preliminarily validates the use of these six components as online reaction behaviours in the SoMeIL framework to infer the self-reported COVID-19 vaccination intentions of Canadian Twitter users in two cities. Funding This study was supported by the 2023-24 Ontario Graduate Scholarship.
Collapse
Affiliation(s)
- Shu-Feng Tsao
- School of Public Health Sciences, Faculty of Health, University of Waterloo, Waterloo, Ontario, Canada
| | - Helen Chen
- School of Public Health Sciences, Faculty of Health, University of Waterloo, Waterloo, Ontario, Canada
| | - Zahid A. Butt
- School of Public Health Sciences, Faculty of Health, University of Waterloo, Waterloo, Ontario, Canada
| |
Collapse
|
2
|
O'Connor K, Golder S, Weissenbacher D, Klein AZ, Magge A, Gonzalez-Hernandez G. Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review. J Med Internet Res 2024; 26:e47923. [PMID: 38488839 PMCID: PMC10980991 DOI: 10.2196/47923] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/28/2023] [Accepted: 08/01/2023] [Indexed: 03/19/2024] Open
Abstract
BACKGROUND Patient health data collected from a variety of nontraditional resources, commonly referred to as real-world data, can be a key information source for health and social science research. Social media platforms, such as Twitter (Twitter, Inc), offer vast amounts of real-world data. An important aspect of incorporating social media data in scientific research is identifying the demographic characteristics of the users who posted those data. Age and gender are considered key demographics for assessing the representativeness of the sample and enable researchers to study subgroups and disparities effectively. However, deciphering the age and gender of social media users poses challenges. OBJECTIVE This scoping review aims to summarize the existing literature on the prediction of the age and gender of Twitter users and provide an overview of the methods used. METHODS We searched 15 electronic databases and carried out reference checking to identify relevant studies that met our inclusion criteria: studies that predicted the age or gender of Twitter users using computational methods. The screening process was performed independently by 2 researchers to ensure the accuracy and reliability of the included studies. RESULTS Of the initial 684 studies retrieved, 74 (10.8%) studies met our inclusion criteria. Among these 74 studies, 42 (57%) focused on predicting gender, 8 (11%) focused on predicting age, and 24 (32%) predicted a combination of both age and gender. Gender prediction was predominantly approached as a binary classification task, with the reported performance of the methods ranging from 0.58 to 0.96 F1-score or 0.51 to 0.97 accuracy. Age prediction approaches varied in terms of classification groups, with a higher range of reported performance, ranging from 0.31 to 0.94 F1-score or 0.43 to 0.86 accuracy. The heterogeneous nature of the studies and the reporting of dissimilar performance metrics made it challenging to quantitatively synthesize results and draw definitive conclusions. CONCLUSIONS Our review found that although automated methods for predicting the age and gender of Twitter users have evolved to incorporate techniques such as deep neural networks, a significant proportion of the attempts rely on traditional machine learning methods, suggesting that there is potential to improve the performance of these tasks by using more advanced methods. Gender prediction has generally achieved a higher reported performance than age prediction. However, the lack of standardized reporting of performance metrics or standard annotated corpora to evaluate the methods used hinders any meaningful comparison of the approaches. Potential biases stemming from the collection and labeling of data used in the studies was identified as a problem, emphasizing the need for careful consideration and mitigation of biases in future studies. This scoping review provides valuable insights into the methods used for predicting the age and gender of Twitter users, along with the challenges and considerations associated with these methods.
Collapse
Affiliation(s)
- Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Su Golder
- Department of Health Sciences, University of York, York, United Kingdom
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, United States
| | - Ari Z Klein
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Arjun Magge
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | | |
Collapse
|
3
|
Chin MK, Đoàn LN, Russo RG, Roberts T, Persaud S, Huang E, Fu L, Kui KY, Kwon SC, Yi SS. Methods for retrospectively improving race/ethnicity data quality: a scoping review. Epidemiol Rev 2023; 45:127-139. [PMID: 37045807 DOI: 10.1093/epirev/mxad002] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 02/27/2023] [Accepted: 04/04/2023] [Indexed: 04/14/2023] Open
Abstract
Improving race and ethnicity (hereafter, race/ethnicity) data quality is imperative to ensure underserved populations are represented in data sets used to identify health disparities and inform health care policy. We performed a scoping review of methods that retrospectively improve race/ethnicity classification in secondary data sets. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, searches were conducted in the MEDLINE, Embase, and Web of Science Core Collection databases in July 2022. A total of 2 441 abstracts were dually screened, 453 full-text articles were reviewed, and 120 articles were included. Study characteristics were extracted and described in a narrative analysis. Six main method types for improving race/ethnicity data were identified: expert review (n = 9; 8%), name lists (n = 27, 23%), name algorithms (n = 55, 46%), machine learning (n = 14, 12%), data linkage (n = 9, 8%), and other (n = 6, 5%). The main racial/ethnic groups targeted for classification were Asian (n = 56, 47%) and White (n = 51, 43%). Some form of validation evaluation was included in 86 articles (72%). We discuss the strengths and limitations of different method types and potential harms of identified methods. Innovative methods are needed to better identify racial/ethnic subgroups and further validation studies. Accurately collecting and reporting disaggregated data by race/ethnicity are critical to address the systematic missingness of relevant demographic data that can erroneously guide policymaking and hinder the effectiveness of health care practices and intervention.
Collapse
Affiliation(s)
- Matthew K Chin
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Lan N Đoàn
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Rienna G Russo
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Timothy Roberts
- NYU Langone Health Sciences Library, NYU Grossman School of Medicine New York, NY 10016, United States
| | - Sonia Persaud
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
- Department of Health Policy and Management, CUNY School of Public Health & Health Policy, New York, NY 10027, United States
| | - Emily Huang
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Lauren Fu
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
- Georgetown University, Washington DC 20007, United States
| | - Kiran Y Kui
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
- Department of Epidemiology, Columbia Mailman School of Public Health, New York, NY 10032, United States
| | - Simona C Kwon
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| | - Stella S Yi
- Section for Health Equity, Department of Population Health, NYU Grossman School of Medicine, New York, NY 10016, United States
| |
Collapse
|
4
|
Weissenbacher D, Flores JI, Wang Y, O’Connor K, Rawal S, Stevens R, Gonzalez-Hernandez G. Automatic Cohort Determination from Twitter for HIV Prevention amongst Black and Hispanic Men. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2022; 2022:504-513. [PMID: 35854738 PMCID: PMC9285152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Recruiting people from diverse backgrounds to participate in health research requires intentional and culture-driven strategic efforts. In this study, we utilize publicly available Twitter posts to identify targeted populations to recruit for our HIV prevention study. Natural language processing and machine learning classification methods were used to find self-declarations of ethnicity, gender, age group, and sexually-explicit language. Using the official Twitter API we collected 47.4 million tweets posted over 8 months from two areas geo-centered around Los Angeles. Using available tools (Demographer and M3), we identified the age and race of 5,392 users as likely young Black or Hispanic men living in Los Angeles. We then collected and analyzed their timelines to automatically find sex-related tweets, yielding 2,166 users. Despite a limited precision, our results suggest that it is possible to automatically identify users based on their demographic attributes and Twitter language characteristics for enrollment into epidemiological studies.
Collapse
Affiliation(s)
| | - J. Ivan Flores
- University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Yunwen Wang
- University of Southern California, Los Angeles, California, USA
| | - Karen O’Connor
- University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | | | - Robin Stevens
- University of Southern California, Los Angeles, California, USA
| | | |
Collapse
|