26
|
Realistic precision and accuracy of online experiment platforms, web browsers, and devices. Behav Res Methods 2020; 53:1407-1425. [PMID: 33140376 PMCID: PMC8367876 DOI: 10.3758/s13428-020-01501-5] [Citation(s) in RCA: 110] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/06/2020] [Indexed: 11/18/2022]
Abstract
Due to increasing ease of use and ability to quickly collect large samples, online behavioural research is currently booming. With this popularity, it is important that researchers are aware of who online participants are, and what devices and software they use to access experiments. While it is somewhat obvious that these factors can impact data quality, the magnitude of the problem remains unclear. To understand how these characteristics impact experiment presentation and data quality, we performed a battery of automated tests on a number of realistic set-ups. We investigated how different web-building platforms (Gorilla v.20190828, jsPsych v6.0.5, Lab.js v19.1.0, and psychoJS/PsychoPy3 v3.1.5), browsers (Chrome, Edge, Firefox, and Safari), and operating systems (macOS and Windows 10) impact display time across 30 different frame durations for each software combination. We then employed a robot actuator in realistic set-ups to measure response recording across the aforementioned platforms, and between different keyboard types (desktop and integrated laptop). Finally, we analysed data from over 200,000 participants on their demographics, technology, and software to provide context to our findings. We found that modern web platforms provide reasonable accuracy and precision for display duration and manual response time, and that no single platform stands out as the best in all features and conditions. In addition, our online participant analysis shows what equipment they are likely to use.
Collapse
|
27
|
Godinho A, Cunningham JA, Schell C. The particular case of conducting addiction intervention research on Mechanical Turk. Addiction 2020; 115:1971-1972. [PMID: 32427392 DOI: 10.1111/add.15097] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 04/22/2020] [Indexed: 02/03/2023]
|
28
|
Ipsen C, Kurth N, Hall J. Evaluating MTurk as a recruitment tool for rural people with disabilities. Disabil Health J 2020; 14:100991. [PMID: 32988778 DOI: 10.1016/j.dhjo.2020.100991] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2020] [Revised: 08/18/2020] [Accepted: 08/21/2020] [Indexed: 11/16/2022]
Abstract
BACKGROUND Recruitment of people with disabilities often occurs through disability organizations, advocacy groups, service providers, and patient registries. Recruitment that relies exclusively on established relationships can produce samples that may miss important information. The MTurk online marketplace offers a convenient option for recruitment. OBJECTIVE The paper compares samples recruited through (1) conventional and (2) MTurk methods to better understand how these samples contrast with one another and with national estimates of people with disabilities. METHODS In 2019, researchers recruited 1374 participants through conventional methods and 758 through MTurk to complete the National Survey on Health and Disability (NSHD). We analyzed sample differences between recruitment groups with t-tests, Chi-square, and logistic regression. RESULTS With the exception of race/ethnicity, the conventional and MTurk samples were significantly different on several dimensions including age, gender, education level, marital status, children living at home, and sexual orientation. The MTurk sample was overrepresented in lower income brackets. A significantly higher percentage of the conventional sample received SSI, SSDI, or both, compared to the MTurk sample (36.2% vs 12.8%) and had significantly higher rates of insurance coverage. Comparisons with American Community Survey data show that the conventional and MTurk samples aligned more closely with the general population of people with disabilities on different characteristics. CONCLUSIONS MTurk is a viable complement to conventional recruitment methods, but it should not be a replacement. A combination of strategies builds a more robust dataset that allows for more varied examination of issues relevant to people with disabilities.
Collapse
|
29
|
Examining the effectiveness of an online program to cultivate mindfulness and self-compassion skills (Mind-OP): Randomized controlled trial on Amazon's Mechanical Turk. Behav Res Ther 2020; 134:103724. [PMID: 32942203 DOI: 10.1016/j.brat.2020.103724] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Revised: 06/19/2020] [Accepted: 09/02/2020] [Indexed: 01/27/2023]
Abstract
OBJECTIVES The demand for effective psychological treatments for depression, anxiety, and heightened stress is far outstripping their supply. Accordingly, internet delivered, self-help interventions offer hope to many people, as they can be easily accessed and at a fraction of the price of face-to-face options. Mindfulness and self-compassion are particularly exciting approaches, as evidence suggests interventions that cultivate these skills are effective in reducing depression, anxiety, and heightened stress. We examined the effectiveness of a newly developed program that combines mindfulness, self-compassion, and goal-setting exercises into a brief self-guided intervention (Mind-OP). The secondary aim of this study was to investigate the feasibility of conducting a randomized-controlled trial entirely on a popular crowdsourcing platform, Amazon's Mechanical Turk (MTurk). METHODS We randomized 456 participants reporting heightened depression, anxiety, or stress to one of two conditions: the 4-week Mind-OP intervention (n = 227) or to an active control condition (n = 229) where participants watched nature videos superimposed onto relaxing meditation music for four consecutive weeks. We administered measures of anxiety, depression, perceived stress, dispositional and state mindfulness, self-compassion, and nonattachment. RESULTS Intent-to-treat and per-protocol analyses revealed that, compared to participants in the control condition, participants in the Mind-OP intervention condition reported significantly less anxiety and stress at the end of the trial, as well as significantly greater mindfulness, self-compassion, and nonattachment. CONCLUSIONS Mind-OP appears effective in reducing anxiety symptoms and perceived stress among MTurk participants. We highlight issues (e.g., attrition) related to feasibility of conducting randomized trials on crowdsourcing platforms such as MTurk.
Collapse
|
30
|
Ogletree AM, Katz B. How Do Older Adults Recruited Using MTurk Differ From Those in a National Probability Sample? Int J Aging Hum Dev 2020; 93:700-721. [PMID: 32683886 DOI: 10.1177/0091415020940197] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
A growing number of studies within the field of gerontology have included samples recruited from Amazon's Mechanical Turk (MTurk), an online crowdsourcing portal. While some research has examined how younger adult participants recruited through other means may differ from those recruited using MTurk, little work has addressed this question with older adults specifically. In the present study, we examined how older adults recruited via MTurk might differ from those recruited via a national probability sample, the Health and Retirement Study (HRS), on a battery of outcomes related to health and cognition. Using a Latin-square design, we examined the relationship between recruitment time, remuneration amount, and measures of cognitive functioning. We found substantial differences between our MTurk sample and the participants within the HRS, most notably within measures of verbal fluency and analogical reasoning. Additionally, remuneration amount was related to differences in time to complete recruitment, particularly at the lowest remuneration level, where recruitment completion required between 138 and 485 additional hours. While the general consensus has been that MTurk samples are a reasonable proxy for the larger population, this work suggests that researchers should be wary of overgeneralizing research conducted with older adults recruited through this portal.
Collapse
|
31
|
Bridges D, Pitiot A, MacAskill MR, Peirce JW. The timing mega-study: comparing a range of experiment generators, both lab-based and online. PeerJ 2020; 8:e9414. [PMID: 33005482 PMCID: PMC7512138 DOI: 10.7717/peerj.9414] [Citation(s) in RCA: 182] [Impact Index Per Article: 45.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Accepted: 06/03/2020] [Indexed: 11/30/2022] Open
Abstract
Many researchers in the behavioral sciences depend on research software that presents stimuli, and records response times, with sub-millisecond precision. There are a large number of software packages with which to conduct these behavioral experiments and measure response times and performance of participants. Very little information is available, however, on what timing performance they achieve in practice. Here we report a wide-ranging study looking at the precision and accuracy of visual and auditory stimulus timing and response times, measured with a Black Box Toolkit. We compared a range of popular packages: PsychoPy, E-Prime®, NBS Presentation®, Psychophysics Toolbox, OpenSesame, Expyriment, Gorilla, jsPsych, Lab.js and Testable. Where possible, the packages were tested on Windows, macOS, and Ubuntu, and in a range of browsers for the online studies, to try to identify common patterns in performance. Among the lab-based experiments, Psychtoolbox, PsychoPy, Presentation and E-Prime provided the best timing, all with mean precision under 1 millisecond across the visual, audio and response measures. OpenSesame had slightly less precision across the board, but most notably in audio stimuli and Expyriment had rather poor precision. Across operating systems, the pattern was that precision was generally very slightly better under Ubuntu than Windows, and that macOS was the worst, at least for visual stimuli, for all packages. Online studies did not deliver the same level of precision as lab-based systems, with slightly more variability in all measurements. That said, PsychoPy and Gorilla, broadly the best performers, were achieving very close to millisecond precision on several browser/operating system combinations. For response times (measured using a high-performance button box), most of the packages achieved precision at least under 10 ms in all browsers, with PsychoPy achieving a precision under 3.5 ms in all. There was considerable variability between OS/browser combinations, especially in audio-visual synchrony which is the least precise aspect of the browser-based experiments. Nonetheless, the data indicate that online methods can be suitable for a wide range of studies, with due thought about the sources of variability that result. The results, from over 110,000 trials, highlight the wide range of timing qualities that can occur even in these dedicated software packages for the task. We stress the importance of scientists making their own timing validation measurements for their own stimuli and computer configuration.
Collapse
|
32
|
Ellis JD, Grekin ER, Dekeyser D, Partridge T. Using an Online Platform to Administer the Single-Session Point Subtraction Aggression Paradigm: An Initial Examination of Feasibility and Validity. Assessment 2020; 28:310-321. [PMID: 32659105 DOI: 10.1177/1073191120940042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Online platforms represent a cost-effective option for data collection; however, it is unclear whether online administration of certain kinds of tasks (e.g., behavioral measures of aggression) poses validity threats. The present study provided a preliminary examination of effort (as indexed by total number of presses), differential drop-out, and believability of an online version of the single-session point subtraction aggression paradigm (PSAP). Two subsamples of participants were recruited; a sample recruited through Amazon's Mechanical Turk (n = 758) and an in-person undergraduate sample (n = 88). All participants completed the PSAP, along with measures of trait hostility and state anger. The online sample did not differ from the in-person sample on effort (i.e., total number of presses), and did not find the task less believable. Higher scores on state anger were associated with lower likelihood of beginning the online PSAP, but were not associated with prematurely closing the task. State anger was related to aggressive responding on the PSAP. Limitations of the online PSAP and considerations for future research are discussed.
Collapse
|
33
|
Buhrmester MD, Talaifar S, Gosling SD. An Evaluation of Amazon's Mechanical Turk, Its Rapid Rise, and Its Effective Use. PERSPECTIVES ON PSYCHOLOGICAL SCIENCE 2019; 13:149-154. [PMID: 29928846 DOI: 10.1177/1745691617706516] [Citation(s) in RCA: 237] [Impact Index Per Article: 47.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Over the past 2 decades, many social scientists have expanded their data-collection capabilities by using various online research tools. In the 2011 article "Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data?" in Perspectives on Psychological Science, Buhrmester, Kwang, and Gosling introduced researchers to what was then considered to be a promising but nascent research platform. Since then, thousands of social scientists from seemingly every field have conducted research using the platform. Here, we reflect on the impact of Mechanical Turk on the social sciences and our article's role in its rise, provide the newest data-driven recommendations to help researchers effectively use the platform, and highlight other online research platforms worth consideration.
Collapse
|
34
|
Reuter K, Zhu Y, Angyan P, Le N, Merchant AA, Zimmer M. Public Concern About Monitoring Twitter Users and Their Conversations to Recruit for Clinical Trials: Survey Study. J Med Internet Res 2019; 21:e15455. [PMID: 31670698 PMCID: PMC6914244 DOI: 10.2196/15455] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2019] [Revised: 10/04/2019] [Accepted: 10/04/2019] [Indexed: 01/17/2023] Open
Abstract
Background Social networks such as Twitter offer the clinical research community a novel opportunity for engaging potential study participants based on user activity data. However, the availability of public social media data has led to new ethical challenges about respecting user privacy and the appropriateness of monitoring social media for clinical trial recruitment. Researchers have voiced the need for involving users’ perspectives in the development of ethical norms and regulations. Objective This study examined the attitudes and level of concern among Twitter users and nonusers about using Twitter for monitoring social media users and their conversations to recruit potential clinical trial participants. Methods We used two online methods for recruiting study participants: the open survey was (1) advertised on Twitter between May 23 and June 8, 2017, and (2) deployed on TurkPrime, a crowdsourcing data acquisition platform, between May 23 and June 8, 2017. Eligible participants were adults, 18 years of age or older, who lived in the United States. People with and without Twitter accounts were included in the study. Results While nearly half the respondents—on Twitter (94/603, 15.6%) and on TurkPrime (509/603, 84.4%)—indicated agreement that social media monitoring constitutes a form of eavesdropping that invades their privacy, over one-third disagreed and nearly 1 in 5 had no opinion. A chi-square test revealed a positive relationship between respondents’ general privacy concern and their average concern about Internet research (P<.005). We found associations between respondents’ Twitter literacy and their concerns about the ability for researchers to monitor their Twitter activity for clinical trial recruitment (P=.001) and whether they consider Twitter monitoring for clinical trial recruitment as eavesdropping (P<.001) and an invasion of privacy (P=.003). As Twitter literacy increased, so did people’s concerns about researchers monitoring Twitter activity. Our data support the previously suggested use of the nonexceptionalist methodology for assessing social media in research, insofar as social media-based recruitment does not need to be considered exceptional and, for most, it is considered preferable to traditional in-person interventions at physical clinics. The expressed attitudes were highly contextual, depending on factors such as the type of disease or health topic (eg, HIV/AIDS vs obesity vs smoking), the entity or person monitoring users on Twitter, and the monitored information. Conclusions The data and findings from this study contribute to the critical dialogue with the public about the use of social media in clinical research. The findings suggest that most users do not think that monitoring Twitter for clinical trial recruitment constitutes inappropriate surveillance or a violation of privacy. However, researchers should remain mindful that some participants might find social media monitoring problematic when connected with certain conditions or health topics. Further research should isolate factors that influence the level of concern among social media users across platforms and populations and inform the development of more clear and consistent guidelines.
Collapse
|
35
|
Cunningham JA, Godinho A, Bertholet N. Outcomes of two randomized controlled trials, employing participants recruited through Mechanical Turk, of Internet interventions targeting unhealthy alcohol use. BMC Med Res Methodol 2019; 19:124. [PMID: 31200648 PMCID: PMC6570877 DOI: 10.1186/s12874-019-0770-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2018] [Accepted: 06/06/2019] [Indexed: 01/08/2023] Open
Abstract
Background Two randomized controlled trials (RCTs) were conducted to explore the utility of the Mechanical Turk (MTurk) crowdsourcing platform to conduct rapid trials evaluating online interventions for unhealthy alcohol use. Methods Both trials employed a staged recruitment procedure where participants who drank in an unhealthy fashion were identified using a baseline survey and then invited to take part in a 6-month follow-up. Participants in both trials were randomized to receive one of several different online interventions or to a no intervention control condition. In study 1, the online interventions were password protected and only those who accessed the study portal were randomized to condition. In study 2, participants were directed to free-of charge interventions and asked to send a screenshot of the intervention to demonstrate that they had complied. Results Participants reporting unhealthy alcohol use were recruited fairly rapidly. Large numbers of screeners were completed (Study 1: n = 4910; Study 2: n = 5812), found eligible (Study 1: n = 3741; Study 2: n = 4095), and randomized to condition (Study 1: n = 511; Study 2: n = 878). Fair follow-up rates were observed at 6 months for each study (Study 1: 82%; Study 2: 66%). Neither trial was able to clearly demonstrate that providing access to the online interventions lead to increased reductions in alcohol use as compared to the control group. Conclusions While recruitment through a crowdsourcing platform is rapid and relatively low cost, it is possible that the lack of impact of the online websites employed in these trials could be due to the source of participants rather than the lack of efficacy of the interventions. Trial registration ClinicalTrials.gov # NCT02977026 and NCT03060135.
Collapse
|
36
|
Abstract
Crowdsourcing services, such as MTurk, have opened a large pool of participants to researchers. Unfortunately, it can be difficult to confidently acquire a sample that matches a given demographic, psychographic, or behavioral dimension. This problem exists because little information is known about individual participants and because some participants are motivated to misrepresent their identity with the goal of financial reward. Despite the fact that online workers do not typically display a greater than average level of dishonesty, when researchers overtly request that only a certain population take part in an online study, a nontrivial portion misrepresent their identity. In this study, a proposed system is tested that researchers can use to quickly, fairly, and easily screen participants on any dimension. In contrast to an overt request, the reported system results in significantly fewer (near zero) instances of participant misrepresentation. Tests for misrepresentations were conducted by using a large database of past participant records (~45,000 unique workers). This research presents and tests an important tool for the increasingly prevalent practice of online data collection.
Collapse
|
37
|
Generalizability of Total Worker Health ® Online Training for Young Workers. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2019; 16:ijerph16040577. [PMID: 30781514 PMCID: PMC6406752 DOI: 10.3390/ijerph16040577] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Revised: 02/09/2019] [Accepted: 02/12/2019] [Indexed: 12/04/2022]
Abstract
Young workers (under 25-years-old) are at risk of workplace injuries due to inexperience, high-risk health behaviors, and a lack of knowledge about workplace hazards. Training based on Total Worker Health® (TWH) principles can improve their knowledge of and ability to identify hazards associated with work organization and environment. In this study, we assessed changes to knowledge and behavior following an online safety and health training between two groups by collecting information on the demographic characteristics, knowledge, and self-reported behaviors of workplace health and safety at three different points in time. The participants’ age ranged from 15 to 24 years. Age adjusted results exhibited a significant increase in knowledge immediately after completing the training, although knowledge decreased in both groups in the follow-up. Amazon Marketplace Mechanical Turk (MTurk) participants demonstrated a greater increase in knowledge, with a significantly higher score compared to the baseline, indicating retention of knowledge three months after completing the training. The majority of participants in both groups reported that they liked the Promoting U through Safety and Health (PUSH) training for improving health and safety and that the training should be provided before starting a job. Participants also said that the training was interactive, informative and humorous. The participants reported that the PUSH training prepared them to identify and control hazards in their workplace and to communicate well with the supervisors and coworkers about their rights. Training programs based on TWH improves the safety, health and well-being of young workers.
Collapse
|
38
|
MTurk Participants Have Substantially Lower Evaluative Subjective Well-Being Than Other Survey Participants. COMPUTERS IN HUMAN BEHAVIOR 2019; 94:1-8. [PMID: 30880871 DOI: 10.1016/j.chb.2018.12.042] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Amazon's MTurk platform has become a popular site for obtaining relatively inexpensive and convenient adult samples for use in behavioral research. Concerns have been raised about selection issues, because MTurk workers chose to participate in the platform and select the tasks they perform (of many offered to them). Prior studies have documented demographic and psychological differences with national samples. In this paper we studied evaluative subjective well-being (the Cantril Ladder) in an MTurk sample, a national Internet panel sample, and a national telephone survey conducted by Gallup-Sharecare. A surprising finding was that MTurk participants' Ladder scores were substantial lower than the other two samples. Analyses controlling for six demographic differences among the samples only slightly reduced the mean differences. However, patterns of demographic-well-being associations were similar within the samples. To corroborate these results, we conducted a secondary analysis on another three samples, one MTurk sample and two Internet panel samples. The same group differences in Ladder scores were observed. These findings add to the growing literature documenting the characteristics of MTurk samples and we discuss the implications for future research with such samples.
Collapse
|
39
|
Schluter MG, Kim HS, Hodgins DC. Obtaining quality data using behavioral measures of impulsivity in gambling research with Amazon's Mechanical Turk. J Behav Addict 2018; 7:1122-1131. [PMID: 30522339 PMCID: PMC6376390 DOI: 10.1556/2006.7.2018.117] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND AND AIMS To date, no research has examined the viability of using behavioral tasks typical of cognitive and neuropsychology within addiction populations through online recruitment methods. Therefore, we examined the reliability and validity of three behavioral tasks of impulsivity common in addiction research in a sample of individuals with a current or past history of problem gambling recruited online. METHODS Using a two-stage recruitment process, a final sample of 110 participants with a history of problem or disordered gambling were recruited through MTurk and completed self-report questionnaires of gambling involvement symptomology, a Delay Discounting Task (DDT), Balloon Analogue Risk Task (BART), Cued Go/No-Go Task, and the UPPS-P. RESULTS Participants demonstrated logically consistent responding on the DDT. The area under the empirical discounting curve (AUC) ranged from 0.02 to 0.88 (M = 0.23). The BART demonstrated good split-third reliability (ρs = 0.67 to 0.78). The tasks generally showed small correlations with each other (ρs = ±0.06 to 0.19) and with UPPS-P subscales (ρs = ±0.01 to 0.20). DISCUSSION AND CONCLUSIONS The behavioral tasks demonstrated good divergent validity. Correlation magnitudes between behavioral tasks and UPPS-P scales and mean scores on these measures were generally consistent with the existing literature. Behavioral tasks of impulsivity appear to have utility for use with problem and disordered gambling samples collected online, allowing researchers a cost efficient and rapid avenue for conducting behavioral research with gamblers. We conclude with best-practice recommendations for using behavioral tasks using crowdsourcing samples.
Collapse
|
40
|
Harman E, Azzam T. Incorporating public values into evaluative criteria: Using crowdsourcing to identify criteria and standards. EVALUATION AND PROGRAM PLANNING 2018; 71:68-82. [PMID: 30165260 DOI: 10.1016/j.evalprogplan.2018.08.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2018] [Revised: 07/20/2018] [Accepted: 08/07/2018] [Indexed: 06/08/2023]
Abstract
At its core, evaluation involves the generation of value judgments. These evaluative judgments are based on comparing an evaluand's performance to what the evaluand is supposed to do (criteria) and how well it is supposed to do it (standards). The aim of this four-phase study was to test whether criteria and standards can be set via crowdsourcing, a potentially cost- and time-effective approach to collecting public opinion data. In the first three phases, participants were presented with a program description, then asked to complete a task to either identify criteria (phase one), weigh criteria (phase two), or set standards (phase three). Phase four found that the crowd-generated criteria were high quality; more specifically, that they were clear and concise, complete, non-overlapping, and realistic. Overall, the study concludes that crowdsourcing has the potential to be used in evaluation for setting stable, high-quality criteria and standards.
Collapse
|
41
|
Anderson CA, Allen JJ, Plante C, Quigley-McBride A, Lovett A, Rokkum JN. The MTurkification of Social and Personality Psychology. PERSONALITY AND SOCIAL PSYCHOLOGY BULLETIN 2018; 45:842-850. [PMID: 30317918 DOI: 10.1177/0146167218798821] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The potential role of brief online studies in changing the types of research and theories likely to evolve is examined in the context of earlier changes in theory and methods in social and personality psychology, changes that favored low-difficulty, high-volume studies. An evolutionary metaphor suggests that the current publication environment of social and personality psychology is a highly competitive one, and that academic survival and reproduction processes (getting a job, tenure/promotion, grants, awards, good graduate students) can result in the extinction of important research domains. Tracking the prevalence of brief online studies, exemplified by studies using Amazon Mechanical Turk, in three top journals ( Journal of Personality and Social Psychology, Personality and Social Psychology Bulletin, Journal of Experimental Social Psychology) reveals a dramatic increase in their frequency and proportion. Implications, suggestions, and questions concerning this trend for the field and questions for its practitioners are discussed.
Collapse
|
42
|
Comparing Amazon's Mechanical Turk Platform to Conventional Data Collection Methods in the Health and Medical Research Literature. J Gen Intern Med 2018; 33:533-538. [PMID: 29302882 PMCID: PMC5880761 DOI: 10.1007/s11606-017-4246-0] [Citation(s) in RCA: 236] [Impact Index Per Article: 39.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Revised: 09/29/2017] [Accepted: 11/21/2017] [Indexed: 10/18/2022]
Abstract
BACKGROUND The goal of this article is to conduct an assessment of the peer-reviewed primary literature with study objectives to analyze Amazon.com 's Mechanical Turk (MTurk) as a research tool in a health services research and medical context. METHODS Searches of Google Scholar and PubMed databases were conducted in February 2017. We screened article titles and abstracts to identify relevant articles that compare data from MTurk samples in a health and medical context to another sample, expert opinion, or other gold standard. Full-text manuscript reviews were conducted for the 35 articles that met the study criteria. RESULTS The vast majority of the studies supported the use of MTurk for a variety of academic purposes. DISCUSSION The literature overwhelmingly concludes that MTurk is an efficient, reliable, cost-effective tool for generating sample responses that are largely comparable to those collected via more conventional means. Caveats include survey responses may not be generalizable to the US population.
Collapse
|
43
|
Are all "research fields" equal? Rethinking practice for the use of data from crowdsourcing market places. Behav Res Methods 2018; 49:1333-1342. [PMID: 27515317 PMCID: PMC5541108 DOI: 10.3758/s13428-016-0789-y] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
New technologies like large-scale social media sites (e.g., Facebook and Twitter) and crowdsourcing services (e.g., Amazon Mechanical Turk, Crowdflower, Clickworker) are impacting social science research and providing many new and interesting avenues for research. The use of these new technologies for research has not been without challenges, and a recently published psychological study on Facebook has led to a widespread discussion of the ethics of conducting large-scale experiments online. Surprisingly little has been said about the ethics of conducting research using commercial crowdsourcing marketplaces. In this article, I focus on the question of which ethical questions are raised by data collection with crowdsourcing tools. I briefly draw on the implications of Internet research more generally, and then focus on the specific challenges that research with crowdsourcing tools faces. I identify fair pay and the related issue of respect for autonomy, as well as problems with the power dynamic between researcher and participant, which has implications for withdrawal without prejudice, as the major ethical challenges of crowdsourced data. Furthermore, I wish to draw attention to how we can develop a "best practice" for researchers using crowdsourcing tools.
Collapse
|
44
|
Harman E, Azzam T. Towards program theory validation: Crowdsourcing the qualitative analysis of participant experiences. EVALUATION AND PROGRAM PLANNING 2018; 66:183-194. [PMID: 28919291 DOI: 10.1016/j.evalprogplan.2017.08.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2015] [Revised: 08/06/2017] [Accepted: 08/14/2017] [Indexed: 06/07/2023]
Abstract
This exploratory study examines a novel tool for validating program theory through crowdsourced qualitative analysis. It combines a quantitative pattern matching framework traditionally used in theory-driven evaluation with crowdsourcing to analyze qualitative interview data. A sample of crowdsourced participants are asked to read an interview transcript and identify whether program theory components (Activities and Outcomes) are discussed and to highlight the most relevant passage about that component. The findings indicate that using crowdsourcing to analyze qualitative data can differentiate between program theory components that are supported by a participant's experience and those that are not. This approach expands the range of tools available to validate program theory using qualitative data, thus strengthening the theory-driven approach.
Collapse
|
45
|
Winking J. Exploring the Great Schism in the Social Sciences: Confirmation Bias and the Interpretation of Results Relating to Biological Influences on Human Behavior and Psychology. EVOLUTIONARY PSYCHOLOGY 2018; 16:1474704917752691. [PMID: 29353493 PMCID: PMC10506139 DOI: 10.1177/1474704917752691] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Accepted: 12/13/2017] [Indexed: 11/16/2022] Open
Abstract
The nature-nurture debate is one that biologists often dismiss as a false dichotomy, as all phenotypic traits are the results of complex processes of gene and environment interactions. However, such dismissiveness belies the ongoing debate that is unmistakable throughout the biological and social sciences concerning the role of biological influences in the development of psychological and behavioral traits in humans. Many have proposed that this debate is due to ideologically driven biases in the interpretation of results. Those favoring biological approaches have been accused of a greater willingness to accept biological explanations so as to rationalize or justify the status quo of inequality. Those rejecting biological approaches have been accused of an unwillingness to accept biological explanations so as to attribute inequalities solely to social and institutional factors, ultimately allowing for the possibility of social equality. While it is important to continue to investigate this topic through further research and debate, another approach is to examine the degree to which the allegations of bias are indeed valid. To accomplish this, a convenience sample of individuals with relevant postgraduate degrees was recruited from Mechanical Turk and social media. Participants were asked to rate the inferential power of different research designs and of mock results that varied in the degree to which they supported different ideologies. Results were suggestive that researchers harbor sincere differences of opinion concerning the inferential value of relevant research. There was no suggestion that ideological confirmation biases drive these differences. However, challenges associated with recruiting a large enough sample of experts as well as identifying believable mock scenarios limit the study's inferential scope.
Collapse
|
46
|
Beymer MR, Holloway IW, Grov C. Comparing Self-Reported Demographic and Sexual Behavioral Factors Among Men Who Have Sex with Men Recruited Through Mechanical Turk, Qualtrics, and a HIV/STI Clinic-Based Sample: Implications for Researchers and Providers. ARCHIVES OF SEXUAL BEHAVIOR 2018; 47:133-142. [PMID: 28332037 PMCID: PMC5610054 DOI: 10.1007/s10508-016-0932-y] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Revised: 12/22/2016] [Accepted: 12/26/2016] [Indexed: 05/25/2023]
Abstract
Recruitment for HIV research among gay, bisexual, and other men who have sex with men (MSM) has increasingly moved to the online sphere. However, there are limited data comparing the characteristics of clinic-based respondents versus those recruited via online survey platforms. MSM were recruited from three sampling sites (STI clinic, MTurk, and Qualtrics) to participate in a survey from March 2015 to April 2016. Respondents were compared between each of the sampling sites on demographics, sexual history, substance use, and attention filter passage. Attention filter passage was high for the online sampling sites (MTurk = 93%; Qualtrics = 86%), but significantly lower for the clinic-based sampling site (72%). Clinic-based respondents were significantly more racially/ethnically diverse, reported lower income, and reported more unemployment than online respondents. Clinic-based respondents reported significantly more male sexual partners in the previous 3 months (M clinic-based = 6; MTurk = 3.6; Qualtrics = 4.5), a higher proportion of gonorrhea, chlamydia, and/or syphilis in the last year, and a greater proportion of methamphetamine use (clinic-based = 21%; MTurk = 5%), and inhaled nitrates use (clinic-based = 41%; MTurk = 11%). The clinic-based sample demonstrated more demographic diversity and a greater proportion of HIV risk behaviors when compared to the online samples, but also a relatively low attention filter passage rate. We recommend the use of attention filters across all modalities to assess response validity and urge caution with online survey engines as samples may differ demographically and behaviorally when compared to clinic-based respondents.
Collapse
|
47
|
Bartek MA, Truitt AR, Widmer-Rodriguez S, Tuia J, Bauer ZA, Comstock BA, Edwards TC, Lawrence SO, Monsell SE, Patrick DL, Jarvik JG, Lavallee DC. The Promise and Pitfalls of Using Crowdsourcing in Research Prioritization for Back Pain: Cross-Sectional Surveys. J Med Internet Res 2017; 19:e341. [PMID: 28986339 PMCID: PMC5650676 DOI: 10.2196/jmir.8821] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2017] [Revised: 09/14/2017] [Accepted: 09/16/2017] [Indexed: 11/13/2022] Open
Abstract
Background The involvement of patients in research better aligns evidence generation to the gaps that patients themselves face when making decisions about health care. However, obtaining patients’ perspectives is challenging. Amazon’s Mechanical Turk (MTurk) has gained popularity over the past decade as a crowdsourcing platform to reach large numbers of individuals to perform tasks for a small reward for the respondent, at small cost to the investigator. The appropriateness of such crowdsourcing methods in medical research has yet to be clarified. Objective The goals of this study were to (1) understand how those on MTurk who screen positive for back pain prioritize research topics compared with those who screen negative for back pain, and (2) determine the qualitative differences in open-ended comments between groups. Methods We conducted cross-sectional surveys on MTurk to assess participants’ back pain and allow them to prioritize research topics. We paid respondents US $0.10 to complete the 24-point Roland Morris Disability Questionnaire (RMDQ) to categorize participants as those “with back pain” and those “without back pain,” then offered both those with (RMDQ score ≥7) and those without back pain (RMDQ <7) an opportunity to rank their top 5 (of 18) research topics for an additional US $0.75. We compared demographic information and research priorities between the 2 groups and performed qualitative analyses on free-text commentary that participants provided. Results We conducted 2 screening waves. We first screened 2189 individuals for back pain over 33 days and invited 480 (21.93%) who screened positive to complete the prioritization, of whom 350 (72.9% of eligible) did. We later screened 664 individuals over 7 days and invited 474 (71.4%) without back pain to complete the prioritization, of whom 397 (83.7% of eligible) did. Those with back pain who prioritized were comparable with those without in terms of age, education, marital status, and employment. The group with back pain had a higher proportion of women (234, 67.2% vs 229, 57.8%, P=.02). The groups’ rank lists of research priorities were highly correlated: Spearman correlation coefficient was .88 when considering topics ranked in the top 5. The 2 groups agreed on 4 of the top 5 and 9 of the top 10 research priorities. Conclusions Crowdsourcing platforms such as MTurk support efforts to efficiently reach large groups of individuals to obtain input on research activities. In the context of back pain, a prevalent and easily understood condition, the rank list of those with back pain was highly correlated with that of those without back pain. However, subtle differences in the content and quality of free-text comments suggest supplemental efforts may be needed to augment the reach of crowdsourcing in obtaining perspectives from patients, especially from specific populations.
Collapse
|
48
|
DePalma MT, Rizzotti MC, Branneman M. Assessing Diabetes-Relevant Data Provided by Undergraduate and Crowdsourced Web-Based Survey Participants for Honesty and Accuracy. JMIR Diabetes 2017; 2:e11. [PMID: 30291072 PMCID: PMC6238844 DOI: 10.2196/diabetes.7473] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 04/12/2017] [Accepted: 05/12/2017] [Indexed: 01/17/2023] Open
Abstract
Background To eliminate health disparities, research will depend on our ability to reach select groups of people (eg, samples of a particular racial or ethnic group with a particular disease); unfortunately, researchers often experience difficulty obtaining high-quality data from samples of sufficient size. Objective Past studies utilizing MTurk applaud its diversity, so our initial objective was to capitalize on MTurk’s diversity to investigate psychosocial factors related to diabetes self-care. Methods In Study 1, a “Health Survey” was posted on MTurk to examine diabetes-relevant psychosocial factors. The survey was restricted to individuals who were 18 years of age or older with diabetes. Detection of irregularities in the data, however, prompted an evaluation of the quality of MTurk health-relevant data. This ultimately led to Study 2, which utilized an alert statement to improve conscientious behavior, or the likelihood that participants would be thorough and diligent in their responses. Trap questions were also embedded to assess conscientious behavior. Results In Study 1, of 4165 responses, 1246 were generated from 533 unique IP addresses completing the survey multiple times within close temporal proximity. Ultimately, only 252 responses were found to be acceptable. Further analyses indicated additional quality concerns with this subsample. In Study 2, as compared with the MTurk sample (N=316), the undergraduate sample (N=300) included more females, and fewer individuals who were married. The samples did not differ with respect to race. Although the presence of an alert resulted in fewer trap failures (mean=0.07) than when no alert was present (mean=0.11), this difference failed to reach significance: F1,604=2.5, P=.11, ƞ²=.004, power=.35. The modal trap failure response was zero, while the mean was 0.092 (SD=0.32). There were a total of 60 trap failures in a context where the potential could have exceeded 16,000. Conclusions Published studies that utilize MTurk participants are rapidly appearing in the health domain. While MTurk may have the potential to be more diverse than an undergraduate sample, our efforts did not meet the criteria for what would constitute a diverse sample in and of itself. Because some researchers have experienced successful data collection on MTurk, while others report disastrous results, Kees et al recently identified that one essential area of research is of the types and magnitude of cheating behavior occurring on Web-based platforms. The present studies can contribute to this dialogue, and alternately provide evidence of disaster and success. Moving forward, it is recommended that researchers employ best practices in survey design and deliberately embed trap questions to assess participant behavior. We would strongly suggest that standards be in place for publishing the results of Web-based surveys—standards that protect against publication unless there are suitable quality assurance tests built into the survey design, distribution, and analysis.
Collapse
|
49
|
Zamboanga BL, Audley S, Olthuis JV, Blumenthal H, Tomaso CC, Bui N, Borsari B. Validation of a Seven-Factor Structure for the Motives for Playing Drinking Games Measure. Assessment 2017; 26:582-603. [PMID: 28412835 DOI: 10.1177/1073191117701191] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Playing drinking games can be characterized as a high-risk drinking activity because games are typically designed to promote heavy alcohol consumption. While research suggests that young adults are motivated to play drinking games for a variety of reasons (e.g., for thrills/fun, for the competition), the Motives for Playing Drinking Games measure has received limited empirical attention. We examined the psychometric properties of this measure with a confirmation sample of young adults recruited from Amazon's MTurk ( N = 1,809, ages 18-25 years, 47% men; 41% not currently enrolled in college) and a validation sample of college students ( N = 671; ages 18-23 years; 26% men). Contrary to the 8-factor model obtained by Johnson and Sheets in a study published in 2004, examination of the factor structure with our confirmation sample yielded a revised 7-factor model that was invariant across race/ethnicity and college student status. This model was also validated with the college student sample. In the confirmation sample, enhancement/thrills and sexual pursuit motives for playing drinking games were positively associated with gaming frequency/consumption and negative gaming consequences. Furthermore, conformity motives for playing drinking games were positively associated with negative gaming consequences, while competition motives were positively associated with gaming frequency. These findings have significant implications for research and prevention/intervention efforts.
Collapse
|
50
|
Contractor AA, Frankfurt SB, Weiss NH, Elhai JD. Latent-level relations between DSM-5 PTSD symptom clusters and problematic smartphone use. COMPUTERS IN HUMAN BEHAVIOR 2017; 72:170-177. [PMID: 28993716 DOI: 10.1016/j.chb.2017.02.051] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
Common mental health consequences following the experience of potentially traumatic events include Posttraumatic Stress Disorder (PTSD) and addictive behaviors. Problematic smartphone use is a newer manifestation of addictive behaviors. People with anxiety severity (such as PTSD) may be at risk for problematic smartphone use as a means of coping with their symptoms. Unique to our knowledge, we assessed relations between PTSD symptom clusters and problematic smartphone use. Participants (N = 347), recruited through Amazon's Mechanical Turk (MTurk), completed measures of PTSD and smartphone addiction. Results of the Wald tests of parameter constraints indicated that problematic smartphone use was more related to PTSD's negative alterations in cognitions and mood (NACM) than to PTSD's avoidance factor, Wald χ2(1, N = 347) = 12.51, p = 0.0004; and more to PTSD's arousal compared to PTSD's avoidance factor, Wald χ2(1, N = 347) = 14.89, p = 0.0001. Results indicate that problematic smartphone use is most associated with negative affect and arousal among trauma-exposed individuals. Implications include the need to clinically assess problematic smartphone use among trauma-exposed individuals presenting with higher NACM and arousal severity; and targeting NACM and arousal symptoms to mitigate the effects of problematic smartphone use.
Collapse
|