1
|
Boligarla S, Laison EKE, Li J, Mahadevan R, Ng A, Lin Y, Thioub MY, Huang B, Ibrahim MH, Nasri B. Leveraging machine learning approaches for predicting potential Lyme disease cases and incidence rates in the United States using Twitter. BMC Med Inform Decis Mak 2023; 23:217. [PMID: 37845666 PMCID: PMC10578027 DOI: 10.1186/s12911-023-02315-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 09/29/2023] [Indexed: 10/18/2023] Open
Abstract
BACKGROUND Lyme disease is one of the most commonly reported infectious diseases in the United States (US), accounting for more than [Formula: see text] of all vector-borne diseases in North America. OBJECTIVE In this paper, self-reported tweets on Twitter were analyzed in order to predict potential Lyme disease cases and accurately assess incidence rates in the US. METHODS The study was done in three stages: (1) Approximately 1.3 million tweets were collected and pre-processed to extract the most relevant Lyme disease tweets with geolocations. A subset of tweets were semi-automatically labelled as relevant or irrelevant to Lyme disease using a set of precise keywords, and the remaining portion were manually labelled, yielding a curated labelled dataset of 77, 500 tweets. (2) This labelled data set was used to train, validate, and test various combinations of NLP word embedding methods and prominent ML classification models, such as TF-IDF and logistic regression, Word2vec and XGboost, and BERTweet, among others, to identify potential Lyme disease tweets. (3) Lastly, the presence of spatio-temporal patterns in the US over a 10-year period were studied. RESULTS Preliminary results showed that BERTweet outperformed all tested NLP classifiers for identifying Lyme disease tweets, achieving the highest classification accuracy and F1-score of [Formula: see text]. There was also a consistent pattern indicating that the West and Northeast regions of the US had a higher tweet rate over time. CONCLUSIONS We focused on the less-studied problem of using Twitter data as a surveillance tool for Lyme disease in the US. Several crucial findings have emerged from the study. First, there is a fairly strong correlation between classified tweet counts and Lyme disease counts, with both following similar trends. Second, in 2015 and early 2016, the social media network like Twitter was essential in raising popular awareness of Lyme disease. Third, counties with a high incidence rate were not necessarily related with a high tweet rate, and vice versa. Fourth, BERTweet can be used as a reliable NLP classifier for detecting relevant Lyme disease tweets.
Collapse
Affiliation(s)
| | - Elda Kokoè Elolo Laison
- Department of Social and Preventive Medicine, École de Santé Publique, University of Montreal, Montréal, Canada
| | - Jiaxin Li
- Harvard Extension School, Harvard University, Cambridge, USA
| | - Raja Mahadevan
- Harvard Extension School, Harvard University, Cambridge, USA
| | - Austen Ng
- Harvard Extension School, Harvard University, Cambridge, USA
| | - Yangming Lin
- Harvard Extension School, Harvard University, Cambridge, USA
| | - Mamadou Yamar Thioub
- Department of Social and Preventive Medicine, École de Santé Publique, University of Montreal, Montréal, Canada
| | - Bruce Huang
- Department of Decision Sciences, HEC Montréal, Montréal, Canada
| | - Mohamed Hamza Ibrahim
- Department of Social and Preventive Medicine, École de Santé Publique, University of Montreal, Montréal, Canada
- Department of Mathematics, Faculty of Science, Zagazig University, Zagazig, Egypt
| | - Bouchra Nasri
- Department of Social and Preventive Medicine, École de Santé Publique, University of Montreal, Montréal, Canada.
| |
Collapse
|
2
|
Zeman P. Tick-Bite "Meteo"-Prevention: An Evaluation of Public Responsiveness to Tick Activity Forecasts Available Online. Life (Basel) 2023; 13:1908. [PMID: 37763311 PMCID: PMC10533051 DOI: 10.3390/life13091908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 09/06/2023] [Accepted: 09/11/2023] [Indexed: 09/29/2023] Open
Abstract
Until causal prophylaxis is available, the avoidance of ticks and personal protection provide the best insurance against contracting a tick-borne disease (TBD). To support public precaution, tick-activity forecasts (TAFs) based on weather projection are provided online for some regions/countries. This study-aimed at evaluating the efficacy of this preventative strategy-was conducted between 2015 and 2019, and included two countries where TAFs are issued regularly (Czech Republic, Germany) and two neighbouring countries for reference (Austria, Switzerland). Google Trends (GT) data were used to trace public concern with TAFs and related health information. GTs were compared with epidemiological data on TBD cases and tick bites, wherever available. Computer simulations of presumable effectiveness under various scenarios were performed. This study showed that public access to TAFs/preventive information is infrequent and not optimally distributed over the season. Interest arises very early in midwinter and then starts to fall in spring/summer when human-tick contacts culminate. Consequently, a greater number of TBD cases are contracted beyond the period of maximum public responsiveness to prevention guidance. Simulations, nevertheless, indicate that there is a potential for doubling the prevention yield if risk assessment, in addition to tick activity, subsumes the population's exposure, and a real-time surrogate is proposed.
Collapse
Affiliation(s)
- Petr Zeman
- Medical Laboratories, Konevova 205, 130 00 Prague, Czech Republic
| |
Collapse
|
3
|
Adams QH, Sun Y, Sun S, Wellenius GA. Internet searches and heat-related emergency department visits in the United States. Sci Rep 2022; 12:9031. [PMID: 35641815 PMCID: PMC9156736 DOI: 10.1038/s41598-022-13168-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 05/12/2022] [Indexed: 11/21/2022] Open
Abstract
Emerging research suggests that internet search patterns may provide timely, actionable insights into adverse health impacts from, and behavioral responses to, days of extreme heat, but few studies have evaluated this hypothesis, and none have done so across the United States. We used two-stage distributed lag nonlinear models to quantify the interrelationships between daily maximum ambient temperature, internet search activity as measured by Google Trends, and heat-related emergency department (ED) visits among adults with commercial health insurance in 30 US metropolitan areas during the warm seasons (May to September) from 2016 to 2019. Maximum daily temperature was positively associated with internet searches relevant to heat, and searches were in turn positively associated with heat-related ED visits. Moreover, models combining internet search activity and temperature had better predictive ability for heat-related ED visits compared to models with temperature alone. These results suggest that internet search patterns may be useful as a leading indicator of heat-related illness or stress.
Collapse
Affiliation(s)
- Quinn H Adams
- Department of Environmental Health, Boston University School of Public Health, Boston, MA, USA.
| | - Yuantong Sun
- Department of Environmental Health, Boston University School of Public Health, Boston, MA, USA
| | - Shengzhi Sun
- Department of Environmental Health, Boston University School of Public Health, Boston, MA, USA
- Optum Labs Visiting Scholar, Eden Prairie, MN, USA
| | - Gregory A Wellenius
- Department of Environmental Health, Boston University School of Public Health, Boston, MA, USA.
- Optum Labs Visiting Scholar, Eden Prairie, MN, USA.
| |
Collapse
|
4
|
Vaidyanathan U, Sun Y, Shekel T, Chou K, Galea S, Gabrilovich E, Wellenius GA. An evaluation of Internet searches as a marker of trends in population mental health in the US. Sci Rep 2022; 12:8946. [PMID: 35624317 PMCID: PMC9136741 DOI: 10.1038/s41598-022-12952-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Accepted: 05/19/2022] [Indexed: 11/26/2022] Open
Abstract
The absence of continuous, real-time mental health assessment has made it challenging to quantify the impacts of the COVID-19 pandemic on population mental health. We examined publicly available, anonymized, aggregated data on weekly trends in Google searches related to anxiety, depression, and suicidal ideation from 2018 to 2020 in the US. We correlated these trends with (1) emergency department (ED) visits for mental health problems and suicide attempts, and (2) surveys of self-reported symptoms of anxiety, depression, and mental health care use. Search queries related to anxiety, depression, and suicidal ideation decreased sharply around March 2020, returning to pre-pandemic levels by summer 2020. Searches related to depression were correlated with the proportion of individuals reporting receiving therapy (r = 0.73), taking medication (r = 0.62) and having unmet mental healthcare needs (r = 0.57) on US Census Household Pulse Survey and modestly correlated with rates of ED visits for mental health conditions. Results were similar when considering instead searches for anxiety. Searches for suicidal ideation did not correlate with external variables. These results suggest aggregated data on Internet searches can provide timely and continuous insights into population mental health and complement other existing tools in this domain.
Collapse
Affiliation(s)
| | - Yuantong Sun
- Department of Environmental Health, Boston University School of Public Health, Boston, MA, USA
| | | | | | - Sandro Galea
- Boston University School of Public Health, Boston, MA, USA
| | | | - Gregory A Wellenius
- Department of Environmental Health, Boston University School of Public Health, Boston, MA, USA
| |
Collapse
|
5
|
Wesson P, Hswen Y, Valdes G, Stojanovski K, Handley MA. Risks and Opportunities to Ensure Equity in the Application of Big Data Research in Public Health. Annu Rev Public Health 2022; 43:59-78. [PMID: 34871504 PMCID: PMC8983486 DOI: 10.1146/annurev-publhealth-051920-110928] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The big data revolution presents an exciting frontier to expand public health research, broadening the scope of research and increasing the precision of answers. Despite these advances, scientists must be vigilant against also advancing potential harms toward marginalized communities. In this review, we provide examples in which big data applications have (unintentionally) perpetuated discriminatory practices, while also highlighting opportunities for big data applications to advance equity in public health. Here, big data is framed in the context of the five Vs (volume, velocity, veracity, variety, and value), and we propose a sixth V, virtuosity, which incorporates equity and justice frameworks. Analytic approaches to improving equity are presented using social computational big data, fairness in machine learning algorithms, medical claims data, and data augmentation as illustrations. Throughout, we emphasize the biasing influence of data absenteeism and positionality and conclude with recommendations for incorporating an equity lens into big data research.
Collapse
Affiliation(s)
- Paul Wesson
- Department of Epidemiology and Biostatistics, University of California, San Francisco, California, USA;
- Bakar Computational Health Sciences Institute, University of California, San Francisco, California, USA
| | - Yulin Hswen
- Department of Epidemiology and Biostatistics, University of California, San Francisco, California, USA;
- Bakar Computational Health Sciences Institute, University of California, San Francisco, California, USA
| | - Gilmer Valdes
- Department of Epidemiology and Biostatistics, University of California, San Francisco, California, USA;
- Department of Radiation Oncology, University of California, San Francisco, California, USA
| | - Kristefer Stojanovski
- Department of Health Behavior and Health Education, School of Public Health, University of Michigan, Ann Arbor, Michigan, USA
- Department of Social, Behavioral and Population Sciences, School of Public Health and Tropical Medicine, Tulane University, New Orleans, Louisiana, USA
| | - Margaret A Handley
- Department of Epidemiology and Biostatistics, University of California, San Francisco, California, USA;
- Department of Medicine, University of California, San Francisco, California, USA
- Zuckerberg San Francisco General Hospital and Trauma Center, San Francisco, California, USA
- Partnerships for Research in Implementation Science for Equity (PRISE), University of California, San Francisco, California, USA
| |
Collapse
|
6
|
Kontowicz E, Brown G, Torner J, Carrel M, Baker KK, Petersen CA. Inclusion of environmentally themed search terms improves Elastic net regression nowcasts of regional Lyme disease rates. PLoS One 2022; 17:e0251165. [PMID: 35271589 PMCID: PMC8912246 DOI: 10.1371/journal.pone.0251165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Accepted: 02/01/2022] [Indexed: 11/19/2022] Open
Abstract
Lyme disease is the most widely reported vector-borne disease in the United States. 95% of confirmed human cases are reported in the Northeast and upper Midwest (25,778 total confirmed cases from Northeast and upper Midwest / 27,203 total US confirmed cases). Human cases typically occur in the spring and summer months when an infected nymph Ixodid tick takes a blood meal. Current federal surveillance strategies report data on an annual basis, leading to nearly a year lag in national data reporting. These lags in reporting make it difficult for public health agencies to assess and plan for the current burden of Lyme disease. Implementation of a nowcasting model, using historical data to predict current trends, provides a means for public health agencies to evaluate current Lyme disease burden and make timely priority-based budgeting decisions. The objective of the study was to develop and compare the performance of nowcasting models using free data from Google Trends and Centers of Disease Control and Prevention surveillance reports. We developed two sets of elastic net models for five regions of the United States: 1. Using only monthly proportional hit data from the 21 disease symptoms and tick related terms, and 2. Using monthly proportional hit data from terms identified via Google correlate and the disease symptom and vector terms. Elastic net models using the full-term list were highly accurate (Root Mean Square Error: 0.74, Mean Absolute Error: 0.52, R2: 0.97) for four of the five regions of the United States and improved accuracy 1.33-fold while reducing error 0.5-fold compared to predictions from models using disease symptom and vector terms alone. Many of the terms included and found to be important for model performance were environmentally related. These models can be implemented to help local and state public health agencies accurately monitor Lyme disease burden during times of reporting lag from federal public health reporting agencies.
Collapse
Affiliation(s)
- Eric Kontowicz
- Department of Epidemiology, College of Public Health, University of Iowa, Iowa City, Iowa, United States of America
- Center for Emerging Infectious Diseases, University of Iowa Research Park, Coralville, Iowa, United States of America
| | - Grant Brown
- Department of Biostatistics, College of Public Health, University of Iowa, Iowa City, Iowa, United States of America
| | - James Torner
- Department of Epidemiology, College of Public Health, University of Iowa, Iowa City, Iowa, United States of America
| | - Margaret Carrel
- Department of Geographical and Sustainability Sciences, College of Liberal Arts & Sciences, University of Iowa, Iowa City, Iowa, United States of America
| | - Kelly K. Baker
- Department of Occupational and Environmental Health, College of Public Health, University of Iowa, Iowa City, United States of America
| | - Christine A. Petersen
- Department of Epidemiology, College of Public Health, University of Iowa, Iowa City, Iowa, United States of America
- Center for Emerging Infectious Diseases, University of Iowa Research Park, Coralville, Iowa, United States of America
- Immunology Program, Carver College of Medicine, University of Iowa, Iowa City, Iowa, United States of America
| |
Collapse
|
7
|
Tran T, Porter WT, Salkeld DJ, Prusinski MA, Jensen ST, Brisson D. Estimating disease vector population size from citizen science data. J R Soc Interface 2021; 18:20210610. [PMID: 34814732 PMCID: PMC8611339 DOI: 10.1098/rsif.2021.0610] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Citizen science projects have the potential to address hypotheses requiring extremely large datasets that cannot be collected with the financial and labour constraints of most scientific projects. Data collection by the general public could expand the scope of scientific enquiry if these data accurately capture the system under study. However, data collection inconsistencies by the untrained public may result in biased datasets that do not accurately represent the natural world. In this paper, we harness the availability of scientific and public datasets of the Lyme disease tick vector to identify and account for biases in citizen science tick collections. Estimates of tick abundance from the citizen science dataset correspond moderately with estimates from direct surveillance but exhibit consistent biases. These biases can be mitigated by including factors that may impact collector participation or effort in statistical models, which, in turn, result in more accurate estimates of tick population sizes. Accounting for collection biases within large-scale, public participation datasets could update species abundance maps and facilitate using the wealth of citizen science data to answer scientific questions at scales that are not feasible with traditional datasets.
Collapse
Affiliation(s)
- Tam Tran
- Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - W Tanner Porter
- Pathogen Genomics Division, Translational Genomics Research Institute, Flagstaff, AZ 86005, USA
| | - Daniel J Salkeld
- Department of Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - Melissa A Prusinski
- Bureau of Communicable Disease Control, New York State Department of Health, Albany, NY 12237, USA
| | - Shane T Jensen
- Department of Statistics, The Wharton School of the University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Dustin Brisson
- Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
8
|
Naik H, Johnson MDD, Johnson MR. Internet Interest in Colon Cancer Following the Death of Chadwick Boseman: Infoveillance Study. J Med Internet Res 2021; 23:e27052. [PMID: 34128824 PMCID: PMC8277405 DOI: 10.2196/27052] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 05/13/2021] [Accepted: 05/24/2021] [Indexed: 12/26/2022] Open
Abstract
Background Compared with White Americans, Black Americans have higher colon cancer mortality rates but lower up-to-date screening rates. Chadwick Boseman was a prominent Black American actor who died of colon cancer on August 28, 2020. As announcements of celebrity diagnoses often result in increased awareness, Boseman’s death may have resulted in greater interest in colon cancer on the internet, particularly among Black Americans. Objective This study aims to quantify the impact of Chadwick Boseman’s death on web-based search interest in colon cancer and determine whether there was an increase in interest in regions of the United States with a greater proportion of Black Americans. Methods We conducted an infoveillance study using Google Trends (GT) and Wikipedia pageview analysis. Using an autoregressive integrated moving average algorithm, we forecasted the weekly relative search volume (RSV) for GT search topics and terms related to colon cancer that would have been expected had his death not occurred and compared it with observed RSV data. This analysis was also conducted for the number of page views on the Wikipedia page for colorectal cancer. We then delineated GT RSV data for the term colon cancer for states and metropolitan areas in the United States and determined how the RSV values for these regions correlated with the percentage of Black Americans in that region. Differences in these correlations before and after Boseman’s death were compared to determine whether there was a shift in the racial demographics of the individuals conducting the searches. Results The observed RSVs for the topics colorectal cancer and colon cancer screening increased by 598% and 707%, respectively, and were on average 121% (95% CI 72%-193%) and 256% (95% CI 35%-814%) greater than expected during the first 3 months following Boseman’s death. Daily Wikipedia page view volume during the 2 months following Boseman’s death was on average 1979% (95% CI 1375%-2894%) greater than expected, and it was estimated that this represented 547,354 (95% CI 497,708-585,167) excess Wikipedia page views. Before Boseman’s death, there were negative correlations between the percentage of Black Americans living in a state or metropolitan area and the RSV for colon cancer in that area (r=−0.18 and r=−0.05, respectively). However, in the 2 weeks following his death, there were positive correlations between the RSV for colon cancer and the percentage of Black Americans per state and per metropolitan area (r=0.73 and r=0.33, respectively). These changes persisted for 4 months and were all statistically significant (P<.001). Conclusions There was a significant increase in web-based activity related to colon cancer following Chadwick Boseman’s death, particularly in areas with a higher proportion of Black Americans. This reflects a heightened public awareness that can be leveraged to further educate the public.
Collapse
Affiliation(s)
- Hiten Naik
- Department of Medicine, University of British Columbia, Vancouver, BC, Canada
| | | | | |
Collapse
|
9
|
Kamiński M, Kręgielska-Narożna M, Bogdański P. Determination of the Popularity of Dietary Supplements Using Google Search Rankings. Nutrients 2020; 12:E908. [PMID: 32224928 PMCID: PMC7231191 DOI: 10.3390/nu12040908] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2020] [Revised: 03/19/2020] [Accepted: 03/25/2020] [Indexed: 12/20/2022] Open
Abstract
The internet provides access to information about dietary supplements and allows their easy purchase. We aimed to rank the interest of Google users in dietary supplements and to determine the changes that occurred in their popularity from 2004 to 2019. We used Google Trends to generate data over time on regional interest in dietary supplements (n = 200). We categorized each included supplement and calculated the interest in all topics in proportion to the relative search volume (RSV) of "lutein". We analyzed the trends over time of all topics and categories. Globally, the topics with the highest popularity were "magnesium", which was 23.72 times more popular than "lutein", "protein" (15.22 times more popular), and "iron" (15.12). The categories of supplements receiving most interest were protein (9.64), mineral (5.24), and vitamin (3.47). The RSV of seven categories of topics (amino acid, bacterial, botanical, fiber, mineral, protein, and vitamin) increased over time while two categories (enzyme and fat or fatty acid) saw a drop in their RSV. Overall, 119 topics saw an increase in interest over time, 19 remained stable, and 62 saw interest in them decrease. Google Trends provides insights into e-discourse and enables analysis of the differences in popularity of certain topics across countries and over time.
Collapse
Affiliation(s)
- Mikołaj Kamiński
- Department of the Treatment of Obesity and Metabolic Disorders, and of Clinical Dietetics, Poznań University of Medical Sciences, Szamarzewskiego 82/84, 60-569 Poznań, Poland; (M.K.-N.); (P.B.)
| | | | | |
Collapse
|