1
|
Hasan E, Duhaime E, Trueblood JS. Boosting wisdom of the crowd for medical image annotation using training performance and task features. Cogn Res Princ Implic 2024; 9:31. [PMID: 38763994 PMCID: PMC11102897 DOI: 10.1186/s41235-024-00558-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 04/29/2024] [Indexed: 05/21/2024] Open
Abstract
A crucial bottleneck in medical artificial intelligence (AI) is high-quality labeled medical datasets. In this paper, we test a large variety of wisdom of the crowd algorithms to label medical images that were initially classified by individuals recruited through an app-based platform. Individuals classified skin lesions from the International Skin Lesion Challenge 2018 into 7 different categories. There was a large dispersion in the geographical location, experience, training, and performance of the recruited individuals. We tested several wisdom of the crowd algorithms of varying complexity from a simple unweighted average to more complex Bayesian models that account for individual patterns of errors. Using a switchboard analysis, we observe that the best-performing algorithms rely on selecting top performers, weighting decisions by training accuracy, and take into account the task environment. These algorithms far exceed expert performance. We conclude by discussing the implications of these approaches for the development of medical AI.
Collapse
Affiliation(s)
- Eeshan Hasan
- Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th St., Bloomington, IN, 47405-7007, USA.
- Cognitive Science Program, Indiana University, Bloomington, USA.
| | | | - Jennifer S Trueblood
- Department of Psychological and Brain Sciences, Indiana University, 1101 E. 10th St., Bloomington, IN, 47405-7007, USA.
- Cognitive Science Program, Indiana University, Bloomington, USA.
| |
Collapse
|
2
|
Madirolas G, Zaghi-Lara R, Gomez-Marin A, Pérez-Escudero A. The motor Wisdom of the Crowd. J R Soc Interface 2022; 19:20220480. [PMID: 36195116 PMCID: PMC9532022 DOI: 10.1098/rsif.2022.0480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 09/15/2022] [Indexed: 11/12/2022] Open
Abstract
Wisdom of the Crowd is the aggregation of many individual estimates to obtain a better collective one. Because of its enormous social potential, this effect has been thoroughly investigated, but predominantly on tasks that involve rational thinking (such as estimating a number). Here we tested this effect in the context of drawing geometrical shapes, which still enacts cognitive processes but mainly involves visuomotor control. We asked more than 700 school students to trace five patterns shown on a touchscreen and then aggregated their individual trajectories to improve the match with the original pattern. Our results show the characteristics of the strongest examples of Wisdom of the Crowd. First, the aggregate trajectory can be up to 5 times more accurate than the individual ones. Second, this great improvement requires aggregating trajectories from different individuals (rather than trials from the same individual). Third, the aggregate trajectory outperforms more than 99% of individual trajectories. Fourth, while older individuals outperform younger ones, a crowd of young individuals outperforms the average older one. These results demonstrate for the first time Wisdom of the Crowd in the realm of motor control, opening the door to further studies of human and also animal behavioural trajectories and their mechanistic underpinnings.
Collapse
Affiliation(s)
- Gabriel Madirolas
- Research Centre on Animal Cognition (CRCA), Centre for Integrative Biology (CBI), Toulouse University, CNRS, UPS, 31062 Toulouse, France
| | - Regina Zaghi-Lara
- Behavior of Organisms Laboratory, Instituto de Neurociencias de Alicante (CSIC-UMH), Alicante, Spain
| | - Alex Gomez-Marin
- Behavior of Organisms Laboratory, Instituto de Neurociencias de Alicante (CSIC-UMH), Alicante, Spain
- The Pari Center, via Tozzi 7, 58045 Pari (GR), Italy
| | - Alfonso Pérez-Escudero
- Research Centre on Animal Cognition (CRCA), Centre for Integrative Biology (CBI), Toulouse University, CNRS, UPS, 31062 Toulouse, France
| |
Collapse
|
3
|
Abstract
With the increase in artificial intelligence in real-world applications, there is interest in building hybrid systems that take both human and machine predictions into account. Previous work has shown the benefits of separately combining the predictions of diverse machine classifiers or groups of people. Using a Bayesian modeling framework, we extend these results by systematically investigating the factors that influence the performance of hybrid combinations of human and machine classifiers while taking into account the unique ways human and algorithmic confidence is expressed. Artificial intelligence (AI) and machine learning models are being increasingly deployed in real-world applications. In many of these applications, there is strong motivation to develop hybrid systems in which humans and AI algorithms can work together, leveraging their complementary strengths and weaknesses. We develop a Bayesian framework for combining the predictions and different types of confidence scores from humans and machines. The framework allows us to investigate the factors that influence complementarity, where a hybrid combination of human and machine predictions leads to better performance than combinations of human or machine predictions alone. We apply this framework to a large-scale dataset where humans and a variety of convolutional neural networks perform the same challenging image classification task. We show empirically and theoretically that complementarity can be achieved even if the human and machine classifiers perform at different accuracy levels as long as these accuracy differences fall within a bound determined by the latent correlation between human and machine classifier confidence scores. In addition, we demonstrate that hybrid human–machine performance can be improved by differentiating between the errors that humans and machine classifiers make across different class labels. Finally, our results show that eliciting and including human confidence ratings improve hybrid performance in the Bayesian combination model. Our approach is applicable to a wide variety of classification problems involving human and machine algorithms.
Collapse
|
4
|
Richardson E, Keil FC. The potential for effective reasoning guides children's preference for small group discussion over crowdsourcing. Sci Rep 2022; 12:1193. [PMID: 35075164 PMCID: PMC8786842 DOI: 10.1038/s41598-021-04680-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 12/20/2021] [Indexed: 11/09/2022] Open
Abstract
Communication between social learners can make a group collectively "wiser" than any individual, but conformist tendencies can also distort collective judgment. We asked whether intuitions about when communication is likely to improve or distort collective judgment could allow social learners to take advantage of the benefits of communication while minimizing the risks. In three experiments (n = 360), 7- to 10-year old children and adults decided whether to refer a question to a small group for discussion or "crowdsource" independent judgments from individual advisors. For problems affording the kind of 'demonstrative' reasoning that allows a group member to reliably correct errors made by even a majority, all ages preferred to consult the discussion group, even compared to a crowd ten times as large-consistent with past research suggesting that discussion groups regularly outperform even their best members for reasoning problems. In contrast, we observed a consistent developmental shift towards crowdsourcing independent judgments when reasoning by itself was insufficient to conclusively answer a question. Results suggest sophisticated intuitions about the nature of social influence and collective intelligence may guide our social learning strategies from early in development.
Collapse
Affiliation(s)
- Emory Richardson
- Department of Psychology, Yale University, 2 Hillhouse Ave, New Haven, CT, 06520-8205, USA.
| | - Frank C Keil
- Department of Psychology, Yale University, 2 Hillhouse Ave, New Haven, CT, 06520-8205, USA
| |
Collapse
|
5
|
Lange-Küttner C, Puiu AA. Perceptual Load and Sex-Specific Personality Traits. Exp Psychol 2021; 68:149-164. [PMID: 34711075 PMCID: PMC8691178 DOI: 10.1027/1618-3169/a000520] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Revised: 07/15/2021] [Accepted: 07/18/2021] [Indexed: 12/04/2022]
Abstract
The impact of sex-specific personality traits has often been investigated for visuospatial tasks such as mental rotation, but less is known about the influence of personality traits on visual search. We investigated whether the Big Five personality traits Extroversion (E), Openness (O), Agreeableness (A), Conscientiousness (C), and Neuroticism (N) and the Autism Quotient (AQ) influence visual search in a sample of N = 65 men and women. In three experiments, we varied stimulus complexity and predictability. As expected, latencies were longer when the target was absent. Pop-out search was faster than conjunction search. A large number of distracters slowed down reaction times (RTs). When stimulus complexity was not predictable in Experiment 3, this reduced search accuracy by about half. As could be predicted based on previous research on long RT tails, conjunction search in target absent trials revealed the impact of personality traits. The RT effect in visual search of the accelerating "less social" AQ score was specific to men, while the effects of the "more social" decelerating Big Five Inventory factors agreeableness and conscientiousness were specific to women. Thus, sex-specific personality traits could explain decision-making thresholds, while visual stimulus complexity yielded an impact of the classic personality traits neuroticism and extroversion.
Collapse
Affiliation(s)
| | - Andrei-Alexandru Puiu
- Department of Psychiatry, Psychotherapy
and Psychosomatics, Faculty of Medicine, RWTH Aachen University,
Germany
| |
Collapse
|
6
|
Lago MA, Jonnalagadda A, Abbey CK, Barufaldi BB, Bakic PR, Maidment ADA, Leung WK, Weinstein SP, Englander BS, Eckstein MP. Under-exploration of Three-Dimensional Images Leads to Search Errors for Small Salient Targets. Curr Biol 2021; 31:1099-1106.e5. [PMID: 33472051 PMCID: PMC8048135 DOI: 10.1016/j.cub.2020.12.029] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 10/09/2020] [Accepted: 12/18/2020] [Indexed: 10/22/2022]
Abstract
Advances in 3D imaging technology are transforming how radiologists search for cancer1,2 and how security officers scrutinize baggage for dangerous objects.3 These new 3D technologies often improve search over 2D images4,5 but vastly increase the image data. Here, we investigate 3D search for targets of various sizes in filtered noise and digital breast phantoms. For a Bayesian ideal observer optimally processing the filtered noise and a convolutional neural network processing the digital breast phantoms, search with 3D image stacks increases target information and improves accuracy over search with 2D images. In contrast, 3D search by humans leads to high miss rates for small targets easily detected in 2D search, but not for larger targets more visible in the visual periphery. Analyses of human eye movements, perceptual judgments, and a computational model with a foveated visual system suggest that human errors can be explained by interaction among a target's peripheral visibility, eye movement under-exploration of the 3D images, and a perceived overestimation of the explored area. Instructing observers to extend the search reduces 75% of the small target misses without increasing false positives. Results with twelve radiologists confirm that even medical professionals reading realistic breast phantoms have high miss rates for small targets in 3D search. Thus, under-exploration represents a fundamental limitation to the efficacy with which humans search in 3D image stacks and miss targets with these prevalent image technologies.
Collapse
Affiliation(s)
- Miguel A Lago
- Department of Psychological and Brain Sciences, University of California, Santa Barbara, Santa Barbara, CA 93106, USA
| | - Aditya Jonnalagadda
- Department of Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USA; Institute for Collaborative Biotechnologies, University of California, Santa Barbara, Santa Barbara, CA 93106, USA
| | - Craig K Abbey
- Department of Psychological and Brain Sciences, University of California, Santa Barbara, Santa Barbara, CA 93106, USA
| | - Bruno B Barufaldi
- Department of Radiology, University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, USA
| | - Predrag R Bakic
- Department of Radiology, University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, USA
| | - Andrew D A Maidment
- Department of Radiology, University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, USA
| | - Winifred K Leung
- Ridley-Tree Cancer Center, Sansum Clinic, 540 W. Pueblo Street, Santa Barbara, CA 93105, USA
| | - Susan P Weinstein
- Department of Radiology, University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, USA
| | - Brian S Englander
- Department of Radiology, University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, USA
| | - Miguel P Eckstein
- Department of Psychological and Brain Sciences, University of California, Santa Barbara, Santa Barbara, CA 93106, USA; Department of Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106, USA; Institute for Collaborative Biotechnologies, University of California, Santa Barbara, Santa Barbara, CA 93106, USA.
| |
Collapse
|
7
|
Wisdom of crowds benefits perceptual decision making across difficulty levels. Sci Rep 2021; 11:538. [PMID: 33436921 PMCID: PMC7804123 DOI: 10.1038/s41598-020-80500-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 12/11/2020] [Indexed: 11/29/2022] Open
Abstract
Decades of research on collective decision making has claimed that aggregated judgment of multiple individuals is more accurate than expert individual judgement. A longstanding problem in this regard has been to determine how decisions of individuals can be combined to form intelligent group decisions. Our study consisted of a random target detection task in natural scenes, where human subjects (18 subjects, 7 female) detected the presence or absence of a random target as indicated by the cue word displayed prior to stimulus display. Concurrently the neural activities (EEG signals) were recorded. A separate behavioural experiment was performed by different subjects (20 subjects, 11 female) on the same set of images to categorize the tasks according to their difficulty levels. We demonstrate that the weighted average of individual decision confidence/neural decision variables produces significantly better performance than the frequently used majority pooling algorithm. Further, the classification error rates from individual judgement were found to increase with increasing task difficulty. This error could be significantly reduced upon combining the individual decisions using group aggregation rules. Using statistical tests, we show that combining all available participants is unnecessary to achieve minimum classification error rate. We also try to explore if group aggregation benefits depend on the correlation between the individual judgements of the group and our results seem to suggest that reduced inter-subject correlation can improve collective decision making for a fixed difficulty level.
Collapse
|
8
|
Brennan PC, Ganesan A, Eckstein MP, Ekpo EU, Tapia K, Mello-Thoms C, Lewis S, Juni MZ. Benefits of Independent Double Reading in Digital Mammography: A Theoretical Evaluation of All Possible Pairing Methodologies. Acad Radiol 2019; 26:717-723. [PMID: 30064917 DOI: 10.1016/j.acra.2018.06.017] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Revised: 06/19/2018] [Accepted: 06/19/2018] [Indexed: 10/28/2022]
Abstract
RATIONALE AND OBJECTIVES To establish the efficacy of pairing readers randomly and evaluate the merits of developing optimal pairing methodologies. MATERIALS AND METHODS Sensitivity, specificity, and proportion correct were computed for three different case sets that were independently read by 16 radiologists. Performance of radiologists as single readers was compared to expected double reading performance. We theoretically evaluated all possible pairing methodologies. Bootstrap resampling methods were used for statistical analyses. RESULTS Significant improvements in expected performance for double versus single reading (ie, delta performance) were shown for all performance measures and case-sets (p ≤ .003), with overall delta performance across all theoretically possible pairing schemes (n = 10,395) ranging between .05 and .08. Delta performance for the 20 best pairing schemes was significant (p < .001) and ranged between .07 and .10. Delta performance for 20 random pairing schemes was also significant (p ≤ .003) and ranged between .05 and .08. Delta performance for the 20 worst pairing schemes ranged between .03 and .06, reaching significance in delta proportion correct (p ≤ .021) for all three case-sets and in delta specificity for two case-sets (p ≤ .033) but not for a third case-set (p = .131), and not reaching significance in delta sensitivity for any of the three case-sets (.098 ≥ p ≥ .067). CONCLUSION Significant benefits accrue from double reading, and while random reader pairing achieves most double reading benefits, a strategic pairing approach may maximize the benefits of double reading.
Collapse
|