1
|
Hammond H, Armstrong M, Thomas GA, Dalmaijer ES, Bull DR, Gilchrist ID. Narrative predicts cardiac synchrony in audiences. Sci Rep 2024; 14:26369. [PMID: 39487185 PMCID: PMC11530447 DOI: 10.1038/s41598-024-73066-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 09/13/2024] [Indexed: 11/04/2024] Open
Abstract
Audio-visual media possesses a remarkable ability to synchronise audiences' neural, behavioural, and physiological responses. This synchronisation is considered to reflect some dimension of collective attention or engagement with the stimulus. But what is it about these stimuli that drives such strong engagement? There are several properties of media stimuli which may lead to synchronous audience response: from low-level audio-visual features, to the story itself. Here, we present a study which separates low-level features from narrative by presenting participants with the same content but in separate modalities. In this way, the presentations shared no low-level features, but participants experienced the same narrative. We show that synchrony in participants' heart rate can be driven by the narrative information alone. We computed both visual and auditory perceptual saliency for the content and found that narrative was approximately 10 times as predictive of heart rate as low-level saliency, but that low-level audio-visual saliency has a small additive effect towards heart rate. Further, heart rate synchrony was related to a separate cohorts' continuous ratings of immersion, and that synchrony is likely to be higher at moments of increased narrative importance. Our findings demonstrate that high-level narrative dominates in the alignment of physiology across viewers.
Collapse
Affiliation(s)
- Hugo Hammond
- School of Psychological Science, University of Bristol, Bristol, UK.
- Bristol Vision Institute, University of Bristol, Bristol, UK.
| | - Michael Armstrong
- British Broadcasting Corporation (BBC) Research and Development, Salford, UK
| | - Graham A Thomas
- British Broadcasting Corporation (BBC) Research and Development, Salford, UK
| | | | - David R Bull
- Bristol Vision Institute, University of Bristol, Bristol, UK
- Department of Electrical and Electronic Engineering, University of Bristol, Bristol, UK
| | - Iain D Gilchrist
- School of Psychological Science, University of Bristol, Bristol, UK
- Bristol Vision Institute, University of Bristol, Bristol, UK
| |
Collapse
|
2
|
Martinez-Cedillo AP, Foulsham T. Don't look now! Social elements are harder to avoid during scene viewing. Vision Res 2024; 216:108356. [PMID: 38184917 DOI: 10.1016/j.visres.2023.108356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 11/09/2023] [Accepted: 12/28/2023] [Indexed: 01/09/2024]
Abstract
Regions of social importance (i.e., other people) attract attention in real world scenes, but it is unclear how automatic this bias is and how it might interact with other guidance factors. To investigate this, we recorded eye movements while participants were explicitly instructed to avoid looking at one of two objects in a scene (either a person or a non-social object). The results showed that, while participants could follow these instructions, they still made errors (especially on the first saccade). Crucially, there were about twice as many erroneous looks towards the person than there were towards the other object. This indicates that it is hard to suppress the prioritization of social information during scene viewing, with implications for how quickly and automatically this information is perceived and attended to.
Collapse
Affiliation(s)
- A P Martinez-Cedillo
- Department of Psychology, University of York, York YO10 5DD, England; Department of Psychology, University of Essex, Wivenhoe Park, Colchester, Essex CO4 3SQ, England.
| | - T Foulsham
- Department of Psychology, University of Essex, Wivenhoe Park, Colchester, Essex CO4 3SQ, England
| |
Collapse
|
3
|
Roth N, Rolfs M, Hellwich O, Obermayer K. Objects guide human gaze behavior in dynamic real-world scenes. PLoS Comput Biol 2023; 19:e1011512. [PMID: 37883331 PMCID: PMC10602265 DOI: 10.1371/journal.pcbi.1011512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 09/12/2023] [Indexed: 10/28/2023] Open
Abstract
The complexity of natural scenes makes it challenging to experimentally study the mechanisms behind human gaze behavior when viewing dynamic environments. Historically, eye movements were believed to be driven primarily by space-based attention towards locations with salient features. Increasing evidence suggests, however, that visual attention does not select locations with high saliency but operates on attentional units given by the objects in the scene. We present a new computational framework to investigate the importance of objects for attentional guidance. This framework is designed to simulate realistic scanpaths for dynamic real-world scenes, including saccade timing and smooth pursuit behavior. Individual model components are based on psychophysically uncovered mechanisms of visual attention and saccadic decision-making. All mechanisms are implemented in a modular fashion with a small number of well-interpretable parameters. To systematically analyze the importance of objects in guiding gaze behavior, we implemented five different models within this framework: two purely spatial models, where one is based on low-level saliency and one on high-level saliency, two object-based models, with one incorporating low-level saliency for each object and the other one not using any saliency information, and a mixed model with object-based attention and selection but space-based inhibition of return. We optimized each model's parameters to reproduce the saccade amplitude and fixation duration distributions of human scanpaths using evolutionary algorithms. We compared model performance with respect to spatial and temporal fixation behavior, including the proportion of fixations exploring the background, as well as detecting, inspecting, and returning to objects. A model with object-based attention and inhibition, which uses saliency information to prioritize between objects for saccadic selection, leads to scanpath statistics with the highest similarity to the human data. This demonstrates that scanpath models benefit from object-based attention and selection, suggesting that object-level attentional units play an important role in guiding attentional processing.
Collapse
Affiliation(s)
- Nicolas Roth
- Cluster of Excellence Science of Intelligence, Technische Universität Berlin, Germany
- Institute of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, Germany
| | - Martin Rolfs
- Cluster of Excellence Science of Intelligence, Technische Universität Berlin, Germany
- Department of Psychology, Humboldt-Universität zu Berlin, Germany
- Bernstein Center for Computational Neuroscience Berlin, Germany
| | - Olaf Hellwich
- Cluster of Excellence Science of Intelligence, Technische Universität Berlin, Germany
- Institute of Computer Engineering and Microelectronics, Technische Universität Berlin, Germany
| | - Klaus Obermayer
- Cluster of Excellence Science of Intelligence, Technische Universität Berlin, Germany
- Institute of Software Engineering and Theoretical Computer Science, Technische Universität Berlin, Germany
- Bernstein Center for Computational Neuroscience Berlin, Germany
| |
Collapse
|
4
|
Kümmerer M, Bethge M. Predicting Visual Fixations. Annu Rev Vis Sci 2023; 9:269-291. [PMID: 37419107 DOI: 10.1146/annurev-vision-120822-072528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/09/2023]
Abstract
As we navigate and behave in the world, we are constantly deciding, a few times per second, where to look next. The outcomes of these decisions in response to visual input are comparatively easy to measure as trajectories of eye movements, offering insight into many unconscious and conscious visual and cognitive processes. In this article, we review recent advances in predicting where we look. We focus on evaluating and comparing models: How can we consistently measure how well models predict eye movements, and how can we judge the contribution of different mechanisms? Probabilistic models facilitate a unified approach to fixation prediction that allows us to use explainable information explained to compare different models across different settings, such as static and video saliency, as well as scanpath prediction. We review how the large variety of saliency maps and scanpath models can be translated into this unifying framework, how much different factors contribute, and how we can select the most informative examples for model comparison. We conclude that the universal scale of information gain offers a powerful tool for the inspection of candidate mechanisms and experimental design that helps us understand the continual decision-making process that determines where we look.
Collapse
Affiliation(s)
| | - Matthias Bethge
- Tübingen AI Center, University of Tübingen, Tübingen, Germany; ,
| |
Collapse
|
5
|
Bruckert A, Christie M, Le Meur O. Where to look at the movies: Analyzing visual attention to understand movie editing. Behav Res Methods 2023; 55:2940-2959. [PMID: 36002630 DOI: 10.3758/s13428-022-01949-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/29/2022] [Indexed: 11/08/2022]
Abstract
In the process of making a movie, directors constantly care about where the spectator will look on the screen. Shot composition, framing, camera movements, or editing are tools commonly used to direct attention. In order to provide a quantitative analysis of the relationship between those tools and gaze patterns, we propose a new eye-tracking database, containing gaze-pattern information on movie sequences, as well as editing annotations, and we show how state-of-the-art computational saliency techniques behave on this dataset. In this work, we expose strong links between movie editing and spectators gaze distributions, and open several leads on how the knowledge of editing information could improve human visual attention modeling for cinematic content. The dataset generated and analyzed for this study is available at https://github.com/abruckert/eye_tracking_filmmaking.
Collapse
|
6
|
Pedziwiatr MA, Heer S, Coutrot A, Bex P, Mareschal I. Prior knowledge about events depicted in scenes decreases oculomotor exploration. Cognition 2023; 238:105544. [PMID: 37419068 DOI: 10.1016/j.cognition.2023.105544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 06/27/2023] [Accepted: 06/28/2023] [Indexed: 07/09/2023]
Abstract
The visual input that the eyes receive usually contains temporally continuous information about unfolding events. Therefore, humans can accumulate knowledge about their current environment. Typical studies on scene perception, however, involve presenting multiple unrelated images and thereby render this accumulation unnecessary. Our study, instead, facilitated it and explored its effects. Specifically, we investigated how recently-accumulated prior knowledge affects gaze behavior. Participants viewed sequences of static film frames that contained several 'context frames' followed by a 'critical frame'. The context frames showed either events from which the situation depicted in the critical frame naturally followed, or events unrelated to this situation. Therefore, participants viewed identical critical frames while possessing prior knowledge that was either relevant or irrelevant to the frames' content. In the former case, participants' gaze behavior was slightly more exploratory, as revealed by seven gaze characteristics we analyzed. This result demonstrates that recently-gained prior knowledge reduces exploratory eye movements.
Collapse
Affiliation(s)
- Marek A Pedziwiatr
- School of Biological and Behavioural Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, United Kingdom.
| | - Sophie Heer
- School of Biological and Behavioural Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, United Kingdom
| | - Antoine Coutrot
- Univ Lyon, CNRS, INSA Lyon, UCBL, LIRIS, UMR5205, F-69621 Lyon, France
| | - Peter Bex
- Department of Psychology, Northeastern University, 107 Forsyth Street, Boston, MA 02115, United States of America
| | - Isabelle Mareschal
- School of Biological and Behavioural Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, United Kingdom
| |
Collapse
|
7
|
Kondyli V, Bhatt M, Levin D, Suchan J. How do drivers mitigate the effects of naturalistic visual complexity? : On attentional strategies and their implications under a change blindness protocol. Cogn Res Princ Implic 2023; 8:54. [PMID: 37556047 PMCID: PMC10412523 DOI: 10.1186/s41235-023-00501-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 07/07/2023] [Indexed: 08/10/2023] Open
Abstract
How do the limits of high-level visual processing affect human performance in naturalistic, dynamic settings of (multimodal) interaction where observers can draw on experience to strategically adapt attention to familiar forms of complexity? In this backdrop, we investigate change detection in a driving context to study attentional allocation aimed at overcoming environmental complexity and temporal load. Results indicate that visuospatial complexity substantially increases change blindness but also that participants effectively respond to this load by increasing their focus on safety-relevant events, by adjusting their driving, and by avoiding non-productive forms of attentional elaboration, thereby also controlling "looked-but-failed-to-see" errors. Furthermore, analyses of gaze patterns reveal that drivers occasionally, but effectively, limit attentional monitoring and lingering for irrelevant changes. Overall, the experimental outcomes reveal how drivers exhibit effective attentional compensation in highly complex situations. Our findings uncover implications for driving education and development of driving skill-testing methods, as well as for human-factors guided development of AI-based driving assistance systems.
Collapse
Affiliation(s)
- Vasiliki Kondyli
- CoDesign Lab EU - codesign-lab.org, Örebro University, Örebro, Sweden.
| | - Mehul Bhatt
- CoDesign Lab EU - codesign-lab.org, Örebro University, Örebro, Sweden
| | | | - Jakob Suchan
- German Aerospace Center - DLR, Institute of Systems Engineering for Future Mobility, Oldenburg, Germany
| |
Collapse
|
8
|
Chang Q, Zhu S. Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection. Cognit Comput 2023. [DOI: 10.1007/s12559-023-10114-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
|
9
|
Jing M, Kadooka K, Franchak J, Kirkorian HL. The effect of narrative coherence and visual salience on children's and adults' gaze while watching video. J Exp Child Psychol 2023; 226:105562. [PMID: 36257254 DOI: 10.1016/j.jecp.2022.105562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Revised: 09/12/2022] [Accepted: 09/14/2022] [Indexed: 11/05/2022]
Abstract
Low-level visual features (e.g., motion, contrast) predict eye gaze during video viewing. The current study investigated the effect of narrative coherence on the extent to which low-level visual salience predicts eye gaze. Eye movements were recorded as 4-year-olds (n = 20) and adults (n = 20) watched a cohesive versus random sequence of video shots from a 4.5-min full vignette from Sesame Street. Overall, visual salience was a stronger predictor of gaze in adults than in children, especially when viewing a random shot sequence. The impact of narrative coherence on children's gaze was limited to the short period of time surrounding cuts to new video shots. The discussion considers potential direct effects of visual salience as well as incidental effects due to overlap between salient features and semantic content. The findings are also discussed in the context of developing video comprehension.
Collapse
Affiliation(s)
- Mengguo Jing
- Department of Human Development and Family Studies, University of Wisconsin-Madison, Madison, WI 53705, USA.
| | - Kellan Kadooka
- Department of Psychology, University of California, Riverside, Riverside, CA 92521, USA
| | - John Franchak
- Department of Psychology, University of California, Riverside, Riverside, CA 92521, USA
| | - Heather L Kirkorian
- Department of Human Development and Family Studies, University of Wisconsin-Madison, Madison, WI 53705, USA
| |
Collapse
|
10
|
Epperlein T, Kovacs G, Oña LS, Amici F, Bräuer J. Context and prediction matter for the interpretation of social interactions across species. PLoS One 2022; 17:e0277783. [PMID: 36477294 PMCID: PMC9728876 DOI: 10.1371/journal.pone.0277783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 11/02/2022] [Indexed: 12/12/2022] Open
Abstract
Predictions about others' future actions are crucial during social interactions, in order to react optimally. Another way to assess such interactions is to define the social context of the situations explicitly and categorize them according to their affective content. Here we investigate how humans assess aggressive, playful and neutral interactions between members of three species: human children, dogs and macaques. We presented human participants with short video clips of real-life interactions of dyads of the three species and asked them either to categorize the context of the situation or to predict the outcome of the observed interaction. Participants performed above chance level in assessing social situations in humans, in dogs and in monkeys. How accurately participants predicted and categorized the situations depended both on the species and on the context. Contrary to our hypothesis, participants were not better at assessing aggressive situations than playful or neutral situations. Importantly, participants performed particularly poorly when assessing aggressive behaviour for dogs. Also, participants were not better at assessing social interactions of humans compared to those of other species. We discuss what mechanism humans use to assess social situations and to what extent this skill can also be found in other social species.
Collapse
Affiliation(s)
- Theresa Epperlein
- DogStudies, Max Planck Institute for Geoanthropology, Jena, Germany
- Department for General Psychology and Cognitive Neuroscience, Friedrich Schiller University of Jena, Jena, Germany
| | - Gyula Kovacs
- Department of Biological Psychology and Cognitive Neuroscience, Friedrich Schiller University of Jena, Jena, Germany
| | - Linda S. Oña
- Max Planck Research Group Naturalistic Social Cognition, Max Planck Institute for Human Development, Berlin, Germany
| | - Federica Amici
- Department of Comparative Cultural Psychology, Max-Planck Institute for Evolutionary Anthropology, Leipzig, Germany
- Behavioral Ecology Research Group, Institute of Biology, Faculty of Life Science, University of Leipzig, Leipzig, Germany
| | - Juliane Bräuer
- DogStudies, Max Planck Institute for Geoanthropology, Jena, Germany
- Department for General Psychology and Cognitive Neuroscience, Friedrich Schiller University of Jena, Jena, Germany
| |
Collapse
|
11
|
Audio–visual collaborative representation learning for Dynamic Saliency Prediction. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
12
|
Broda MD, de Haas B. Individual fixation tendencies in person viewing generalize from images to videos. Iperception 2022; 13:20416695221128844. [PMID: 36353505 PMCID: PMC9638695 DOI: 10.1177/20416695221128844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 09/09/2022] [Indexed: 11/06/2022] Open
Abstract
Fixation behavior toward persons in static scenes varies considerably between individuals. However, it is unclear whether these differences generalize to dynamic stimuli. Here, we examined individual differences in the distribution of gaze across seven person features (i.e. body and face parts) in static and dynamic scenes. Forty-four participants freely viewed 700 complex static scenes followed by eight director-cut videos (28,925 frames). We determined the presence of person features using hand-delineated pixel masks (images) and Deep Neural Networks (videos). Results replicated highly consistent individual differences in fixation tendencies for all person features in static scenes and revealed that these tendencies generalize to videos. Individual fixation behavior for both, images and videos, fell into two anticorrelated clusters representing the tendency to fixate faces versus bodies. These results corroborate a low-dimensional space for individual gaze biases toward persons and show they generalize from images to videos.
Collapse
Affiliation(s)
- Maximilian D. Broda
- Department of Experimental
Psychology, Justus Liebig University Giessen, Germany; Center for Mind, Brain and Behavior
(CMBB), University of Marburg and Justus Liebig University Giessen,
Germany
| | - Benjamin de Haas
- Department of Experimental
Psychology, Justus Liebig University Giessen, Germany; Center for Mind, Brain and Behavior
(CMBB), University of Marburg and Justus Liebig University Giessen,
Germany
| |
Collapse
|
13
|
Lu YJ, Kuo IC, Ho MC. The effects of emotional films and subtitle types on eye movement patterns. Acta Psychol (Amst) 2022; 230:103748. [PMID: 36122479 DOI: 10.1016/j.actpsy.2022.103748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Revised: 07/13/2022] [Accepted: 09/13/2022] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND In Taiwan, the use of subtitle is common in TV programs and movies. However, studies on subtitles mostly focus on foreign language learning and film subtitle translation. Few studies address how subtitle types and emotion-laden films affect the viewers' eye movement patterns. PURPOSE We aim to examine how the emotion type of film (happy, sad, angry, fear, or neutral) and subtitle type (meaningful subtitle, no subtitle, or meaningless subtitle) affect the dwell times and fixation counts in the subtitle area. METHODS This study is a 5 (emotion type of film) × 3 (subtitle type) between-participants design. There were 15 participants per condition, resulting in a total of 225 participants. After watching a film, participants filled out a self-reported questionnaire regarding this film. RESULTS The subtitled films have more fixation counts and dwell time for the meaningful subtitle compared to meaningless subtitle and no subtitle. The dwell time was longer on the subtitle area for the sad film than the neutral and happy films. Also, the dwell time was longer on the subtitle area for the fear film than the happy film. There were more fixation counts on the subtitle area for the sad film than the angry and happy films. CONCLUSIONS The subtitle meaning is critical in directing overt attention. Also, overt attention directed to the subtitle area is affected by the different emotion types of films.
Collapse
Affiliation(s)
- Yun-Jhen Lu
- Department of Rehabilitation Medicine, Chi Mei Hospital, Chiali, Taiwan
| | - I-Chun Kuo
- Department of Psychology, Chung-Shan Medical University, Taichung, Taiwan
| | - Ming-Chou Ho
- Department of Psychology, Chung-Shan Medical University, Taichung, Taiwan; Clinical Psychological Room, Chung-Shan Medical University Hospital, Taichung, Taiwan.
| |
Collapse
|
14
|
A Gated Fusion Network for Dynamic Saliency Prediction. IEEE Trans Cogn Dev Syst 2022. [DOI: 10.1109/tcds.2021.3094974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
15
|
Evaluating Eye Movement Event Detection: A Review of the State of the Art. Behav Res Methods 2022:10.3758/s13428-021-01763-7. [PMID: 35715615 DOI: 10.3758/s13428-021-01763-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/27/2021] [Indexed: 11/08/2022]
Abstract
Detecting eye movements in raw eye tracking data is a well-established research area by itself, as well as a common pre-processing step before any subsequent analysis. As in any field, however, progress and successful collaboration can only be achieved provided a shared understanding of the pursued goal. This is often formalised via defining metrics that express the quality of an approach to solving the posed problem. Both the big-picture intuition behind the evaluation strategies and seemingly small implementation details influence the resulting measures, making even studies with outwardly similar procedures essentially incomparable, impeding a common understanding. In this review, we systematically describe and analyse evaluation methods and measures employed in the eye movement event detection field to date. While recently developed evaluation strategies tend to quantify the detector's mistakes at the level of whole eye movement events rather than individual gaze samples, they typically do not separate establishing correspondences between true and predicted events from the quantification of the discovered errors. In our analysis we separate these two steps where possible, enabling their almost arbitrary combinations in an evaluation pipeline. We also present the first large-scale empirical analysis of event matching strategies in the literature, examining these various combinations both in practice and theoretically. We examine the particular benefits and downsides of the evaluation methods, providing recommendations towards more intuitive and informative assessment. We implemented the evaluation strategies on which this work focuses in a single publicly available library: https://github.com/r-zemblys/EM-event-detection-evaluation .
Collapse
|
16
|
Hutson JP, Chandran P, Magliano JP, Smith TJ, Loschky LC. Narrative Comprehension Guides Eye Movements in the Absence of Motion. Cogn Sci 2022; 46:e13131. [PMID: 35579883 DOI: 10.1111/cogs.13131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 02/17/2022] [Accepted: 02/19/2022] [Indexed: 11/30/2022]
Abstract
Viewers' attentional selection while looking at scenes is affected by both top-down and bottom-up factors. However, when watching film, viewers typically attend to the movie similarly irrespective of top-down factors-a phenomenon we call the tyranny of film. A key difference between still pictures and film is that film contains motion, which is a strong attractor of attention and highly predictive of gaze during film viewing. The goal of the present study was to test if the tyranny of film is driven by motion. To do this, we created a slideshow presentation of the opening scene of Touch of Evil. Context condition participants watched the full slideshow. No-context condition participants did not see the opening portion of the scene, which showed someone placing a time bomb into the trunk of a car. In prior research, we showed that despite producing very different understandings of the clip, this manipulation did not affect viewers' attention (i.e., the tyranny of film), as both context and no-context participants were equally likely to fixate on the car with the bomb when the scene was presented as a film. The current study found that when the scene was shown as a slideshow, the context manipulation produced differences in attentional selection (i.e., it attenuated attentional synchrony). We discuss these results in the context of the Scene Perception and Event Comprehension Theory, which specifies the relationship between event comprehension and attentional selection in the context of visual narratives.
Collapse
Affiliation(s)
- John P Hutson
- Department of Learning Sciences, Georgia State University
| | | | | | - Tim J Smith
- Department of Psychological Sciences, Birkbeck, University of London
| | | |
Collapse
|
17
|
Deep Distributional Sequence Embeddings Based on a Wasserstein Loss. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10784-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
AbstractDeep metric learning employs deep neural networks to embed instances into a metric space such that distances between instances of the same class are small and distances between instances from different classes are large. In most existing deep metric learning techniques, the embedding of an instance is given by a feature vector produced by a deep neural network and Euclidean distance or cosine similarity defines distances between these vectors. This paper studies deep distributional embeddings of sequences, where the embedding of a sequence is given by the distribution of learned deep features across the sequence. The motivation for this is to better capture statistical information about the distribution of patterns within the sequence in the embedding. When embeddings are distributions rather than vectors, measuring distances between embeddings involves comparing their respective distributions. The paper therefore proposes a distance metric based on Wasserstein distances between the distributions and a corresponding loss function for metric learning, which leads to a novel end-to-end trainable embedding model. We empirically observe that distributional embeddings outperform standard vector embeddings and that training with the proposed Wasserstein metric outperforms training with other distance functions.
Collapse
|
18
|
Nuthmann A, Canas-Bajo T. Visual search in naturalistic scenes from foveal to peripheral vision: A comparison between dynamic and static displays. J Vis 2022; 22:10. [PMID: 35044436 PMCID: PMC8802022 DOI: 10.1167/jov.22.1.10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Accepted: 12/03/2021] [Indexed: 11/24/2022] Open
Abstract
How important foveal, parafoveal, and peripheral vision are depends on the task. For object search and letter search in static images of real-world scenes, peripheral vision is crucial for efficient search guidance, whereas foveal vision is relatively unimportant. Extending this research, we used gaze-contingent Blindspots and Spotlights to investigate visual search in complex dynamic and static naturalistic scenes. In Experiment 1, we used dynamic scenes only, whereas in Experiments 2 and 3, we directly compared dynamic and static scenes. Each scene contained a static, contextually irrelevant target (i.e., a gray annulus). Scene motion was not predictive of target location. For dynamic scenes, the search-time results from all three experiments converge on the novel finding that neither foveal nor central vision was necessary to attain normal search proficiency. Since motion is known to attract attention and gaze, we explored whether guidance to the target was equally efficient in dynamic as compared to static scenes. We found that the very first saccade was guided by motion in the scene. This was not the case for subsequent saccades made during the scanning epoch, representing the actual search process. Thus, effects of task-irrelevant motion were fast-acting and short-lived. Furthermore, when motion was potentially present (Spotlights) or absent (Blindspots) in foveal or central vision only, we observed differences in verification times for dynamic and static scenes (Experiment 2). When using scenes with greater visual complexity and more motion (Experiment 3), however, the differences between dynamic and static scenes were much reduced.
Collapse
Affiliation(s)
- Antje Nuthmann
- Institute of Psychology, Kiel University, Kiel, Germany
- Psychology Department, School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, UK
- http://orcid.org/0000-0003-3338-3434
| | - Teresa Canas-Bajo
- Vision Science Graduate Group, University of California, Berkeley, Berkeley, CA, USA
- Psychology Department, School of Philosophy, Psychology and Language Sciences, University of Edinburgh, Edinburgh, UK
| |
Collapse
|
19
|
Chen YT, Ho MC. Eye movement patterns differ while watching captioned videos of second language vs. mathematics lessons. LEARNING AND INDIVIDUAL DIFFERENCES 2022. [DOI: 10.1016/j.lindif.2021.102106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
20
|
Review of Visual Saliency Prediction: Development Process from Neurobiological Basis to Deep Models. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app12010309] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The human attention mechanism can be understood and simulated by closely associating the saliency prediction task to neuroscience and psychology. Furthermore, saliency prediction is widely used in computer vision and interdisciplinary subjects. In recent years, with the rapid development of deep learning, deep models have made amazing achievements in saliency prediction. Deep learning models can automatically learn features, thus solving many drawbacks of the classic models, such as handcrafted features and task settings, among others. Nevertheless, the deep models still have some limitations, for example in tasks involving multi-modality and semantic understanding. This study focuses on summarizing the relevant achievements in the field of saliency prediction, including the early neurological and psychological mechanisms and the guiding role of classic models, followed by the development process and data comparison of classic and deep saliency prediction models. This study also discusses the relationship between the model and human vision, as well as the factors that cause the semantic gaps, the influences of attention in cognitive research, the limitations of the saliency model, and the emerging applications, to provide new saliency predictions for follow-up work and the necessary help and advice.
Collapse
|
21
|
Sun H, Roberts AC, Bus A. Bilingual children's visual attention while reading digital picture books and story retelling. J Exp Child Psychol 2021; 215:105327. [PMID: 34894472 DOI: 10.1016/j.jecp.2021.105327] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Revised: 11/03/2021] [Accepted: 11/08/2021] [Indexed: 11/30/2022]
Abstract
This study examined Mandarin-English bilingual children's visual attention over repetitive readings of Mandarin enhanced digital books and static books as well as the effects visual attention has on story retelling. We assigned 89 4- and 5-year-old preschoolers in Singapore to one of three reading conditions: (a) digital books with visual and auditory enhancements, (b) digital books with only auditory enhancements, and (c) static digital books with neither visual nor auditory enhancements. We presented three stories to the children in four sessions over 2 weeks, traced their visual attention with an eye tracker, and examined their story retelling after the first and fourth readings. The results demonstrated that the digital books with visual and auditory enhancements maintained greater visual attention from children compared with that from children in the other two conditions across the four repetitive readings. Moreover, children's bilingual language proficiency significantly modulates the conditional effects of attention. Children with higher bilingual proficiency in the visual and auditory enhancements condition outperformed their peers in the other two conditions in terms of visual attention across most readings. However, for the children with lower bilingual proficiency, the digital books with auditory and visual enhancements only outperformed the static condition but not the auditory enhanced condition. Children with lower language proficiency maintained their attention at a relatively high level across the repetitive readings in the enhanced digital book conditions but demonstrated significantly decreased visual attention in the static digital book condition. Because children with better visual attention and higher bilingual proficiency retold the stories significantly better, the results indicate that influencing visual attention helps to improve story comprehension.
Collapse
Affiliation(s)
- He Sun
- National Institute of Education, Nanyang Technological University, Singapore 637616, Singapore.
| | | | - Adriana Bus
- University of Stavanger, 4021 Stavanger, Norway
| |
Collapse
|
22
|
Essex C, Gliga T, Singh M, Smith TJ. Understanding the differential impact of children's TV on executive functions: a narrative-processing analysis. Infant Behav Dev 2021; 66:101661. [PMID: 34784571 DOI: 10.1016/j.infbeh.2021.101661] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 10/25/2021] [Accepted: 10/28/2021] [Indexed: 11/28/2022]
Abstract
Evidence from multiple empirical studies suggests children's Executive Functions are depleted immediately after viewing some types of TV content but not others. Correlational evidence suggests any such effects may be most problematic during the pre-school years. To establish whether "screen-time" is developmentally appropriate at this age we believe a nuanced approach must be taken to the analysis of individual pieces of media and their potential demands on viewer cognition. To this end we apply a cognitive theory of visual narrative processing, the Scene Perception and Event Comprehension Theory (SPECT; Loschky, Larson, Smith, & Magliano, 2020) to the analysis of TV shows previously used to investigate short-term effects of TV viewing. A theoretical formalisation of individual content properties, together with a quantitative content-based analysis of previously used children's content (Lillard & Peterson, 2011; Lillard et al., 2015b) is presented. This analysis found a pattern of greater stimulus saliency, increased situational change and a greater combined presence of cognitively demanding features for videos previously shown to reduce children's EF after viewing. Limitations of this pilot application of SPECT are presented and proposals for future empirical investigations of the psychological mechanisms activated by specific TV viewing content are considered.
Collapse
Affiliation(s)
- Claire Essex
- Centre for Brain and Cognitive Development, Birkbeck, University of London, UK.
| | | | - Maninda Singh
- School of Psychology, University of Surrey, Guildford, Surrey, UK
| | - Tim J Smith
- Department of Psychological Sciences, Birkbeck, University of London, UK
| |
Collapse
|
23
|
Liu H, Hu X, Ren Y, Wang L, Guo L, Guo CC, Han J. Neural Correlates of Interobserver Visual Congruency in Free-Viewing Condition. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2020.3002765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
24
|
Smith ME, Loschky LC, Bailey HR. Knowledge guides attention to goal-relevant information in older adults. COGNITIVE RESEARCH-PRINCIPLES AND IMPLICATIONS 2021; 6:56. [PMID: 34406505 PMCID: PMC8374018 DOI: 10.1186/s41235-021-00321-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 07/31/2021] [Indexed: 11/18/2022]
Abstract
How does viewers’ knowledge guide their attention while they watch everyday events, how does it affect their memory, and does it change with age? Older adults have diminished episodic memory for everyday events, but intact semantic knowledge. Indeed, research suggests that older adults may rely on their semantic memory to offset impairments in episodic memory, and when relevant knowledge is lacking, older adults’ memory can suffer. Yet, the mechanism by which prior knowledge guides attentional selection when watching dynamic activity is unclear. To address this, we studied the influence of knowledge on attention and memory for everyday events in young and older adults by tracking their eyes while they watched videos. The videos depicted activities that older adults perform more frequently than young adults (balancing a checkbook, planting flowers) or activities that young adults perform more frequently than older adults (installing a printer, setting up a video game). Participants completed free recall, recognition, and order memory tests after each video. We found age-related memory deficits when older adults had little knowledge of the activities, but memory did not differ between age groups when older adults had relevant knowledge and experience with the activities. Critically, results showed that knowledge influenced where viewers fixated when watching the videos. Older adults fixated less goal-relevant information compared to young adults when watching young adult activities, but they fixated goal-relevant information similarly to young adults, when watching more older adult activities. Finally, results showed that fixating goal-relevant information predicted free recall of the everyday activities for both age groups. Thus, older adults may use relevant knowledge to more effectively infer the goals of actors, which guides their attention to goal-relevant actions, thus improving their episodic memory for everyday activities.
Collapse
Affiliation(s)
- Maverick E Smith
- Department of Psychological Sciences, Kansas State University, 471 Bluemont Hall, 1100 Mid-campus Dr., Manhattan, KS, 66506, USA.
| | - Lester C Loschky
- Department of Psychological Sciences, Kansas State University, 471 Bluemont Hall, 1100 Mid-campus Dr., Manhattan, KS, 66506, USA
| | - Heather R Bailey
- Department of Psychological Sciences, Kansas State University, 471 Bluemont Hall, 1100 Mid-campus Dr., Manhattan, KS, 66506, USA
| |
Collapse
|
25
|
Gaze Behavior Effect on Gaze Data Visualization at Different Abstraction Levels. SENSORS 2021; 21:s21144686. [PMID: 34300425 PMCID: PMC8309511 DOI: 10.3390/s21144686] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Revised: 06/28/2021] [Accepted: 07/06/2021] [Indexed: 11/17/2022]
Abstract
Many gaze data visualization techniques intuitively show eye movement together with visual stimuli. The eye tracker records a large number of eye movements within a short period. Therefore, visualizing raw gaze data with the visual stimulus appears complicated and obscured, making it difficult to gain insight through visualization. To avoid the complication, we often employ fixation identification algorithms for more abstract visualizations. In the past, many scientists have focused on gaze data abstraction with the attention map and analyzed detail gaze movement patterns with the scanpath visualization. Abstract eye movement patterns change dramatically depending on fixation identification algorithms in the preprocessing. However, it is difficult to find out how fixation identification algorithms affect gaze movement pattern visualizations. Additionally, scientists often spend much time on adjusting parameters manually in the fixation identification algorithms. In this paper, we propose a gaze behavior-based data processing method for abstract gaze data visualization. The proposed method classifies raw gaze data using machine learning models for image classification, such as CNN, AlexNet, and LeNet. Additionally, we compare the velocity-based identification (I-VT), dispersion-based identification (I-DT), density-based fixation identification, velocity and dispersion-based (I-VDT), and machine learning based and behavior-based modelson various visualizations at each abstraction level, such as attention map, scanpath, and abstract gaze movement visualization.
Collapse
|
26
|
Wang W, Shen J, Lu X, Hoi SCH, Ling H. Paying Attention to Video Object Pattern Understanding. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:2413-2428. [PMID: 31940522 DOI: 10.1109/tpami.2020.2966453] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper conducts a systematic study on the role of visual attention in video object pattern understanding. By elaborately annotating three popular video segmentation datasets (DAVIS 16, Youtube-Objects, and SegTrack V2) with dynamic eye-tracking data in the unsupervised video object segmentation (UVOS) setting. For the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgments during dynamic, task-driven viewing. Such novel observations provide an in-depth insight of the underlying rationale behind video object pattens. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major advantages: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on four popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance compared with state-of-the-arts and enjoys fast processing speed (10 fps on a single GPU). Our collected eye-tracking data and algorithm implementations have been made publicly available at https://github.com/wenguanwang/AGS.
Collapse
|
27
|
Levin DT, Salas JA, Wright AM, Seiffert AE, Carter KE, Little JW. The Incomplete Tyranny of Dynamic Stimuli: Gaze Similarity Predicts Response Similarity in Screen-Captured Instructional Videos. Cogn Sci 2021; 45:e12984. [PMID: 34170026 DOI: 10.1111/cogs.12984] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 01/19/2021] [Accepted: 04/16/2021] [Indexed: 11/27/2022]
Abstract
Although eye tracking has been used extensively to assess cognitions for static stimuli, recent research suggests that the link between gaze and cognition may be more tenuous for dynamic stimuli such as videos. Part of the difficulty in convincingly linking gaze with cognition is that in dynamic stimuli, gaze position is strongly influenced by exogenous cues such as object motion. However, tests of the gaze-cognition link in dynamic stimuli have been done on only a limited range of stimuli often characterized by highly organized motion. Also, analyses of cognitive contrasts between participants have been mostly been limited to categorical contrasts among small numbers of participants that may have limited the power to observe more subtle influences. We, therefore, tested for cognitive influences on gaze for screen-captured instructional videos, the contents of which participants were tested on. Between-participant scanpath similarity predicted between-participant similarity in responses on test questions, but with imperfect consistency across videos. We also observed that basic gaze parameters and measures of attention to centers of interest only inconsistently predicted learning, and that correlations between gaze and centers of interest defined by other-participant gaze and cursor movement did not predict learning. It, therefore, appears that the search for eye movement indices of cognition during dynamic naturalistic stimuli may be fruitful, but we also agree that the tyranny of dynamic stimuli is real, and that links between eye movements and cognition are highly dependent on task and stimulus properties.
Collapse
Affiliation(s)
- Daniel T Levin
- Department of Psychology and Human Development, Vanderbilt University
| | - Jorge A Salas
- Department of Psychology and Human Development, Vanderbilt University
| | - Anna M Wright
- Department of Psychology and Human Development, Vanderbilt University
| | | | - Kelly E Carter
- Department of Psychology and Human Development, Vanderbilt University
| | - Joshua W Little
- Department of Psychology and Human Development, Vanderbilt University
| |
Collapse
|
28
|
Balzarotti S, Cavaletti F, D'Aloia A, Colombo B, Cardani E, Ciceri MR, Antonietti A, Eugeni R. The Editing Density of Moving Images Influences Viewers' Time Perception: The Mediating Role of Eye Movements. Cogn Sci 2021; 45:e12969. [PMID: 33844350 DOI: 10.1111/cogs.12969] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Revised: 02/26/2021] [Accepted: 03/04/2021] [Indexed: 11/29/2022]
Abstract
The present study examined whether cinematographic editing density affects viewers' perception of time. As a second aim, based on embodied models that conceive time perception as strictly connected to the movement, we tested the hypothesis that the editing density of moving images also affects viewers' eye movements and that these later mediate the effect of editing density on viewers' temporal judgments. Seventy participants watched nine video clips edited by manipulating the number of cuts (slow- and fast-paced editing against a master shot, unedited condition). For each editing density, multiple video clips were created, representing three different kinds of routine actions. The participants' eye movements were recorded while watching the video, and the participants were asked to report duration judgments and subjective passage of time judgments after watching each clip. The results showed that participants subjectively perceived that time flew more while watching fast-paced edited videos than slow-paced or unedited videos; by contrast, concerning duration judgments, participants overestimated the duration of fast-paced videos compared to the master-shot videos. Both the slow- and the fast-paced editing generated shorter fixations than the master shot, and the fast-paced editing led to shorter fixations than the slow-paced editing. Finally, compared to the unedited condition, editing led to an overestimation of durations through increased eye mobility. These findings suggest that the editing density of moving images by increasing the number of cuts effectively altered viewers' experience of time and add further evidence to prior research showing that performed eye movement is associated with temporal judgments.
Collapse
Affiliation(s)
| | | | - Adriano D'Aloia
- Department of Letters, Philosophy, Communication, University of Bergamo
| | | | - Elisa Cardani
- Department of Psychology, Università Cattolica del Sacro Cuore
| | | | | | - Ruggero Eugeni
- Department of Communication and Performing Arts, Università Cattolica del Sacro Cuore
| |
Collapse
|
29
|
Ringer RV. Investigating Visual Crowding of Objects in Complex Real-World Scenes. Iperception 2021; 12:2041669521994150. [PMID: 35145614 PMCID: PMC8822316 DOI: 10.1177/2041669521994150] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2019] [Accepted: 01/07/2021] [Indexed: 11/23/2022] Open
Abstract
Visual crowding, the impairment of object recognition in peripheral vision due to flanking objects, has generally been studied using simple stimuli on blank backgrounds. While crowding is widely assumed to occur in natural scenes, it has not been shown rigorously yet. Given that scene contexts can facilitate object recognition, crowding effects may be dampened in real-world scenes. Therefore, this study investigated crowding using objects in computer-generated real-world scenes. In two experiments, target objects were presented with four flanker objects placed uniformly around the target. Previous research indicates that crowding occurs when the distance between the target and flanker is approximately less than half the retinal eccentricity of the target. In each image, the spacing between the target and flanker objects was varied considerably above or below the standard (0.5) threshold to either suppress or facilitate the crowding effect. Experiment 1 cued the target location and then briefly flashed the scene image before participants could move their eyes. Participants then selected the target object's category from a 15-alternative forced choice response set (including all objects shown in the scene). Experiment 2 used eye tracking to ensure participants were centrally fixating at the beginning of each trial and showed the image for the duration of the participant's fixation. Both experiments found object recognition accuracy decreased with smaller spacing between targets and flanker objects. Thus, this study rigorously shows crowding of objects in semantically consistent real-world scenes.
Collapse
Affiliation(s)
- Ryan V. Ringer
- Department of Psychology, Wichita State University, Wichita, Kansas, United States
| |
Collapse
|
30
|
Cortical Activity Linked to Clocking in Deaf Adults: fNIRS Insights with Static and Animated Stimuli Presentation. Brain Sci 2021; 11:brainsci11020196. [PMID: 33562848 PMCID: PMC7914875 DOI: 10.3390/brainsci11020196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Revised: 01/28/2021] [Accepted: 02/02/2021] [Indexed: 11/16/2022] Open
Abstract
The question of the possible impact of deafness on temporal processing remains unanswered. Different findings, based on behavioral measures, show contradictory results. The goal of the present study is to analyze the brain activity underlying time estimation by using functional near infrared spectroscopy (fNIRS) techniques, which allow examination of the frontal, central and occipital cortical areas. A total of 37 participants (19 deaf) were recruited. The experimental task involved processing a road scene to determine whether the driver had time to safely execute a driving task, such as overtaking. The road scenes were presented in animated format, or in sequences of 3 static images showing the beginning, mid-point, and end of a situation. The latter presentation required a clocking mechanism to estimate the time between the samples to evaluate vehicle speed. The results show greater frontal region activity in deaf people, which suggests that more cognitive effort is needed to process these scenes. The central region, which is involved in clocking according to several studies, is particularly activated by the static presentation in deaf people during the estimation of time lapses. Exploration of the occipital region yielded no conclusive results. Our results on the frontal and central regions encourage further study of the neural basis of time processing and its links with auditory capacity.
Collapse
|
31
|
Borji A. Saliency Prediction in the Deep Learning Era: Successes and Limitations. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:679-700. [PMID: 31425064 DOI: 10.1109/tpami.2019.2935715] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Visual saliency models have enjoyed a big leap in performance in recent years, thanks to advances in deep learning and large scale annotated data. Despite enormous effort and huge breakthroughs, however, models still fall short in reaching human-level accuracy. In this work, I explore the landscape of the field emphasizing on new deep saliency models, benchmarks, and datasets. A large number of image and video saliency models are reviewed and compared over two image benchmarks and two large scale video datasets. Further, I identify factors that contribute to the gap between models and humans and discuss the remaining issues that need to be addressed to build the next generation of more powerful saliency models. Some specific questions that are addressed include: in what ways current models fail, how to remedy them, what can be learned from cognitive studies of attention, how explicit saliency judgments relate to fixations, how to conduct fair model comparison, and what are the emerging applications of saliency models.
Collapse
|
32
|
Drivers use active gaze to monitor waypoints during automated driving. Sci Rep 2021; 11:263. [PMID: 33420150 PMCID: PMC7794576 DOI: 10.1038/s41598-020-80126-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 12/14/2020] [Indexed: 11/08/2022] Open
Abstract
Automated vehicles (AVs) will change the role of the driver, from actively controlling the vehicle to primarily monitoring it. Removing the driver from the control loop could fundamentally change the way that drivers sample visual information from the scene, and in particular, alter the gaze patterns generated when under AV control. To better understand how automation affects gaze patterns this experiment used tightly controlled experimental conditions with a series of transitions from 'Manual' control to 'Automated' vehicle control. Automated trials were produced using either a 'Replay' of the driver's own steering trajectories or standard 'Stock' trials that were identical for all participants. Gaze patterns produced during Manual and Automated conditions were recorded and compared. Overall the gaze patterns across conditions were very similar, but detailed analysis shows that drivers looked slightly further ahead (increased gaze time headway) during Automation with only small differences between Stock and Replay trials. A novel mixture modelling method decomposed gaze patterns into two distinct categories and revealed that the gaze time headway increased during Automation. Further analyses revealed that while there was a general shift to look further ahead (and fixate the bend entry earlier) when under automated vehicle control, similar waypoint-tracking gaze patterns were produced during Manual driving and Automation. The consistency of gaze patterns across driving modes suggests that active-gaze models (developed for manual driving) might be useful for monitoring driver engagement during Automated driving, with deviations in gaze behaviour from what would be expected during manual control potentially indicating that a driver is not closely monitoring the automated system.
Collapse
|
33
|
Wang W, Shen J, Xie J, Cheng MM, Ling H, Borji A. Revisiting Video Saliency Prediction in the Deep Learning Era. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:220-237. [PMID: 31247542 DOI: 10.1109/tpami.2019.2924417] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.
Collapse
|
34
|
Szita K. The Effects of Smartphone Spectatorship on Attention, Arousal, Engagement, and Comprehension. Iperception 2021; 12:2041669521993140. [PMID: 33680420 PMCID: PMC7900791 DOI: 10.1177/2041669521993140] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 01/18/2021] [Indexed: 11/15/2022] Open
Abstract
The popularity of watching movies and videos on handheld devices is rising, yet little attention has been paid to its impact on viewer behaviour. Smartphone spectatorship is characterized by the small handheld screen as well as the viewing environment where various unrelated stimuli can occur, providing possible distractions from viewing. Previous research suggests that screen size, handheld control, and external stimuli can affect viewing experience; however, no prior studies have combined these factors or applied them for the specific case of smartphones. In the present study, we compared smartphone and large-screen viewing of feature films in the presence and absence of external distractors. Using a combination of eye tracking, electrodermal activity measures, self-reports, and recollection accuracy tests, we measured smartphone-accustomed viewers' attention, arousal, engagement, and comprehension. The results revealed the impact of viewing conditions on eye movements, gaze dispersion, electrodermal activity, self-reports of engagement, as well as comprehension. These findings show that smartphone viewing is more effective when there are no distractions, and smartphone viewers are more likely to be affected by external stimuli. In addition, watching large stationary screens in designated viewing environments increases engagement with a movie.
Collapse
Affiliation(s)
- Kata Szita
- Kata Szita, University of Gothenburg, Box
200, Goteborg 405 30, Sweden.
| |
Collapse
|
35
|
Zhang K, Chen Z, Liu S. A Spatial-Temporal Recurrent Neural Network for Video Saliency Prediction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 30:572-587. [PMID: 33206602 DOI: 10.1109/tip.2020.3036749] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this paper, a recurrent neural network is designed for video saliency prediction considering spatial-temporal features. In our work, video frames are routed through the static network for spatial features and the dynamic network for temporal features. For the spatial-temporal feature integration, a novel select and re-weight fusion model is proposed which can learn and adjust the fusion weights based on the spatial and temporal features in different scenes automatically. Finally, an attention-aware convolutional long short term memory (ConvLSTM) network is developed to predict salient regions based on the features extracted from consecutive frames and generate the ultimate saliency map for each video frame. The proposed method is compared with state-of-the-art saliency models on five public video saliency benchmark datasets. The experimental results demonstrate that our model can achieve advanced performance on video saliency prediction.
Collapse
|
36
|
Völter CJ, Karl S, Huber L. Dogs accurately track a moving object on a screen and anticipate its destination. Sci Rep 2020; 10:19832. [PMID: 33199751 PMCID: PMC7670446 DOI: 10.1038/s41598-020-72506-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 09/02/2020] [Indexed: 11/12/2022] Open
Abstract
The prediction of upcoming events is of importance not only to humans and non-human primates but also to other animals that live in complex environments with lurking threats or moving prey. In this study, we examined motion tracking and anticipatory looking in dogs in two eye-tracking experiments. In Experiment 1, we presented pet dogs (N = 14) with a video depicting how two players threw a Frisbee back and forth multiple times. The horizontal movement of the Frisbee explained a substantial amount of variance of the dogs' horizontal eye movements. With increasing duration of the video, the dogs looked at the catcher before the Frisbee arrived. In Experiment 2, we showed the dogs (N = 12) the same video recording. This time, however, we froze and rewound parts of the video to examine how the dogs would react to surprising events (i.e., the Frisbee hovering in midair and reversing its direction). The Frisbee again captured the dogs' attention, particularly when the video was frozen and rewound for the first time. Additionally, the dogs looked faster at the catcher when the video moved forward compared to when it was rewound. We conclude that motion tracking and anticipatory looking paradigms provide promising tools for future cognitive research with canids.
Collapse
Affiliation(s)
- Christoph J Völter
- Messerli Research Institute, University of Veterinary Medicine Vienna, Vienna, Austria.
- Messerli Research Institute, Medical University of Vienna, Vienna, Austria.
- Messerli Research Institute, University of Vienna, Vienna, Austria.
| | - Sabrina Karl
- Messerli Research Institute, University of Veterinary Medicine Vienna, Vienna, Austria
- Messerli Research Institute, Medical University of Vienna, Vienna, Austria
- Messerli Research Institute, University of Vienna, Vienna, Austria
| | - Ludwig Huber
- Messerli Research Institute, University of Veterinary Medicine Vienna, Vienna, Austria
- Messerli Research Institute, Medical University of Vienna, Vienna, Austria
- Messerli Research Institute, University of Vienna, Vienna, Austria
| |
Collapse
|
37
|
Jiang L, Xu M, Wang Z, Sigal L. DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention. Int J Comput Vis 2020. [DOI: 10.1007/s11263-020-01371-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
38
|
Magliano JP, Kurby CA, Ackerman T, Garlitch SM, Stewart JM. Lights, camera, action: the role of editing and framing on the processing of filmed events. JOURNAL OF COGNITIVE PSYCHOLOGY 2020. [DOI: 10.1080/20445911.2020.1796685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Affiliation(s)
- Joseph P. Magliano
- Department of Learning Sciences, Georgia State University, Atlanta, GA, USA
| | | | - Thomas Ackerman
- School of Filmmaking, The University of North Carolina School of the Arts, Winston-Salem, NC, USA
| | - Sydney M. Garlitch
- Department of Psychology, University of North Carolina at Greensboro, Greensboro, NC, USA
| | - J. Mac Stewart
- Department of Psychology, Grand Valley State University, Grand Rapids, MI, USA
| |
Collapse
|
39
|
Recent Advances in Saliency Estimation for Omnidirectional Images, Image Groups, and Video Sequences. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10155143] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
We present a review of methods for automatic estimation of visual saliency: the perceptual property that makes specific elements in a scene stand out and grab the attention of the viewer. We focus on domains that are especially recent and relevant, as they make saliency estimation particularly useful and/or effective: omnidirectional images, image groups for co-saliency, and video sequences. For each domain, we perform a selection of recent methods, we highlight their commonalities and differences, and describe their unique approaches. We also report and analyze the datasets involved in the development of such methods, in order to reveal additional peculiarities of each domain, such as the representation used for the ground truth saliency information (scanpaths, saliency maps, or salient object regions). We define domain-specific evaluation measures, and provide quantitative comparisons on the basis of common datasets and evaluation criteria, highlighting the different impact of existing approaches on each domain. We conclude by synthesizing the emerging directions for research in the specialized literature, which include novel representations for omnidirectional images, inter- and intra- image saliency decomposition for co-saliency, and saliency shift for video saliency estimation.
Collapse
|
40
|
Rubo M, Gamer M. Stronger reactivity to social gaze in virtual reality compared to a classical laboratory environment. Br J Psychol 2020; 112:301-314. [PMID: 32484935 DOI: 10.1111/bjop.12453] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Revised: 05/05/2020] [Indexed: 12/31/2022]
Abstract
People show a robust tendency to gaze at other human beings when viewing images or videos, but were also found to relatively avoid gaze at others in several real-world situations. This discrepancy, along with theoretical considerations, spawned doubts about the appropriateness of classical laboratory-based experimental paradigms in social attention research. Several researchers instead suggested the use of immersive virtual scenarios in eliciting and measuring naturalistic attentional patterns, but the field, struggling with methodological challenges, still needs to establish the advantages of this approach. Here, we show using eye-tracking in a complex social scenario displayed in virtual reality that participants show enhanced attention towards the face of an avatar at near distance and demonstrate an increased reactivity towards her social gaze as compared to participants who viewed the same scene on a computer monitor. The present study suggests that reactive virtual agents observed in immersive virtual reality can elicit natural modes of information processing and can help to conduct ecologically more valid experiments while maintaining high experimental control.
Collapse
Affiliation(s)
- Marius Rubo
- Department of Psychology, Freiburg University, Switzerland.,Department of Psychology, Julius Maximilians University of Würzburg, Germany
| | - Matthias Gamer
- Department of Psychology, Julius Maximilians University of Würzburg, Germany
| |
Collapse
|
41
|
Zhang Q, Wang X, Wang S, Sun Z, Kwong S, Jiang J. Learning to Explore Saliency for Stereoscopic Videos via Component-Based Interaction. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:5722-5736. [PMID: 32286984 DOI: 10.1109/tip.2020.2985531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this paper, we devise a saliency prediction model for stereoscopic videos that learns to explore saliency inspired by the component-based interactions including spatial, temporal, as well as depth cues. The model first takes advantage of specific structure of 3D residual network (3D-ResNet) to model the saliency driven by spatio-temporal coherence from consecutive frames. Subsequently, the saliency inferred by implicit-depth is automatically derived based on the displacement correlation between left and right views by leveraging a deep convolutional network (ConvNet). Finally, a component-wise refinement network is devised to produce final saliency maps over time by aggregating saliency distributions obtained from multiple components. In order to further facilitate research towards stereoscopic video saliency, we create a new dataset including 175 stereoscopic video sequences with diverse content, as well as their dense eye fixation annotations. Extensive experiments support that our proposed model can achieve superior performance compared to the state-of-the-art methods on all publicly available eye fixation datasets.
Collapse
|
42
|
Haensel JX, Danvers M, Ishikawa M, Itakura S, Tucciarelli R, Smith TJ, Senju A. Culture modulates face scanning during dyadic social interactions. Sci Rep 2020; 10:1958. [PMID: 32029826 PMCID: PMC7005015 DOI: 10.1038/s41598-020-58802-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Accepted: 12/11/2019] [Indexed: 11/08/2022] Open
Abstract
Recent studies have revealed significant cultural modulations on face scanning strategies, thereby challenging the notion of universality in face perception. Current findings are based on screen-based paradigms, which offer high degrees of experimental control, but lack critical characteristics common to social interactions (e.g., social presence, dynamic visual saliency), and complementary approaches are required. The current study used head-mounted eye tracking techniques to investigate the visual strategies for face scanning in British/Irish (in the UK) and Japanese adults (in Japan) who were engaged in dyadic social interactions with a local research assistant. We developed novel computational data pre-processing tools and data-driven analysis techniques based on Monte Carlo permutation testing. The results revealed significant cultural differences in face scanning during social interactions for the first time, with British/Irish participants showing increased mouth scanning and the Japanese group engaging in greater eye and central face looking. Both cultural groups further showed more face orienting during periods of listening relative to speaking, and during the introduction task compared to a storytelling game, thereby replicating previous studies testing Western populations. Altogether, these findings point to the significant role of postnatal social experience in specialised face perception and highlight the adaptive nature of the face processing system.
Collapse
Affiliation(s)
- Jennifer X Haensel
- Birkbeck, University of London, Department of Psychological Sciences, London, WC1E 7HX, United Kingdom.
| | - Matthew Danvers
- Birkbeck, University of London, Department of Psychological Sciences, London, WC1E 7HX, United Kingdom
| | | | - Shoji Itakura
- Kyoto University, Department of Psychology, Kyoto, 606-8501, Japan
| | - Raffaele Tucciarelli
- Birkbeck, University of London, Department of Psychological Sciences, London, WC1E 7HX, United Kingdom
| | - Tim J Smith
- Birkbeck, University of London, Department of Psychological Sciences, London, WC1E 7HX, United Kingdom
| | - Atsushi Senju
- Birkbeck, University of London, Department of Psychological Sciences, London, WC1E 7HX, United Kingdom
| |
Collapse
|
43
|
Rösler L, Rubo M, Gamer M. Artificial Faces Predict Gaze Allocation in Complex Dynamic Scenes. Front Psychol 2020; 10:2877. [PMID: 31920893 PMCID: PMC6930810 DOI: 10.3389/fpsyg.2019.02877] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2019] [Accepted: 12/04/2019] [Indexed: 11/13/2022] Open
Abstract
Both low-level physical saliency and social information, as presented by human heads or bodies, are known to drive gaze behavior in free-viewing tasks. Researchers have previously made use of a great variety of face stimuli, ranging from photographs of real humans to schematic faces, frequently without systematically differentiating between the two. In the current study, we used a Generalized Linear Mixed Model (GLMM) approach to investigate to what extent schematic artificial faces can predict gaze when they are presented alone or in competition with real human faces. Relative differences in predictive power became apparent, while GLMMs suggest substantial effects for real and artificial faces in all conditions. Artificial faces were accordingly less predictive than real human faces but still contributed significantly to gaze allocation. These results help to further our understanding of how social information guides gaze in complex naturalistic scenes.
Collapse
Affiliation(s)
- Lara Rösler
- Department of Psychology, Julius-Maximilians-Universität Würzburg, Würzburg, Germany
| | - Marius Rubo
- Department of Psychology, Julius-Maximilians-Universität Würzburg, Würzburg, Germany
| | - Matthias Gamer
- Department of Psychology, Julius-Maximilians-Universität Würzburg, Würzburg, Germany
| |
Collapse
|
44
|
Franchak JM. Visual exploratory behavior and its development. PSYCHOLOGY OF LEARNING AND MOTIVATION 2020. [DOI: 10.1016/bs.plm.2020.07.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
45
|
Startsev M, Agtzidis I, Dorr M. Characterizing and automatically detecting smooth pursuit in a large-scale ground-truth data set of dynamic natural scenes. J Vis 2019; 19:10. [DOI: 10.1167/19.14.10] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Affiliation(s)
- Mikhail Startsev
- Human-Machine Communication, Technical University of Munich, Munich, Germany
| | - Ioannis Agtzidis
- Human-Machine Communication, Technical University of Munich, Munich, Germany
| | - Michael Dorr
- Human-Machine Communication, Technical University of Munich, Munich, Germany
| |
Collapse
|
46
|
Transfer learning of deep neural network representations for fMRI decoding. J Neurosci Methods 2019; 328:108319. [DOI: 10.1016/j.jneumeth.2019.108319] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 06/06/2019] [Accepted: 06/17/2019] [Indexed: 11/22/2022]
|
47
|
Loschky LC, Larson AM, Smith TJ, Magliano JP. The Scene Perception & Event Comprehension Theory (SPECT) Applied to Visual Narratives. Top Cogn Sci 2019; 12:311-351. [PMID: 31486277 PMCID: PMC9328418 DOI: 10.1111/tops.12455] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Revised: 08/05/2019] [Accepted: 08/05/2019] [Indexed: 11/29/2022]
Abstract
Understanding how people comprehend visual narratives (including picture stories, comics, and film) requires the combination of traditionally separate theories that span the initial sensory and perceptual processing of complex visual scenes, the perception of events over time, and comprehension of narratives. Existing piecemeal approaches fail to capture the interplay between these levels of processing. Here, we propose the Scene Perception & Event Comprehension Theory (SPECT), as applied to visual narratives, which distinguishes between front‐end and back‐end cognitive processes. Front‐end processes occur during single eye fixations and are comprised of attentional selection and information extraction. Back‐end processes occur across multiple fixations and support the construction of event models, which reflect understanding of what is happening now in a narrative (stored in working memory) and over the course of the entire narrative (stored in long‐term episodic memory). We describe relationships between front‐ and back‐end processes, and medium‐specific differences that likely produce variation in front‐end and back‐end processes across media (e.g., picture stories vs. film). We describe several novel research questions derived from SPECT that we have explored. By addressing these questions, we provide greater insight into how attention, information extraction, and event model processes are dynamically coordinated to perceive and understand complex naturalistic visual events in narratives and the real world. Comprehension of visual narratives like comics, picture stories, and films involves both decoding the visual content and construing the meaningful events they represent. The Scene Perception & Event Comprehension Theory (SPECT) proposes a framework for understanding how a comprehender perceptually negotiates the surface of a visual representation and integrates its meaning into a growing mental model.
Collapse
Affiliation(s)
| | | | - Tim J Smith
- Department of Psychological Sciences, Birkbeck, University of London
| | | |
Collapse
|
48
|
Lai Q, Wang W, Sun H, Shen J. Video Saliency Prediction using Spatiotemporal Residual Attentive Networks. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:1113-1126. [PMID: 31449021 DOI: 10.1109/tip.2019.2936112] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper proposes a novel residual attentive learning network architecture for predicting dynamic eye-fixation maps. The proposed model emphasizes two essential issues, i.e, effective spatiotemporal feature integration and multi-scale saliency learning. For the first problem, appearance and motion streams are tightly coupled via dense residual cross connections, which integrate appearance information with multi-layer, comprehensive motion features in a residual and dense way. Beyond traditional two-stream models learning appearance and motion features separately, such design allows early, multi-path information exchange between different domains, leading to a unified and powerful spatiotemporal learning architecture. For the second one, we propose a composite attention mechanism that learns multi-scale local attentions and global attention priors end-to-end. It is used for enhancing the fused spatiotemporal features via emphasizing important features in multi-scales. A lightweight convolutional Gated Recurrent Unit (convGRU), which is flexible for small training data situation, is used for long-term temporal characteristics modeling. Extensive experiments over four benchmark datasets clearly demonstrate the advantage of the proposed video saliency model over other competitors and the effectiveness of each component of our network. Our code and all the results will be available at https://github.com/ashleylqx/STRA-Net.
Collapse
|
49
|
Sun M, Zhou Z, Hu Q, Wang Z, Jiang J. SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2900-2911. [PMID: 29993731 DOI: 10.1109/tcyb.2018.2832053] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Data-driven saliency detection has attracted strong interest as a result of applying convolutional neural networks to the detection of eye fixations. Although a number of image-based salient object and fixation detection models have been proposed, video fixation detection still requires more exploration. Different from image analysis, motion and temporal information is a crucial factor affecting human attention when viewing video sequences. Although existing models based on local contrast and low-level features have been extensively researched, they failed to simultaneously consider interframe motion and temporal information across neighboring video frames, leading to unsatisfactory performance when handling complex scenes. To this end, we propose a novel and efficient video eye fixation detection model to improve the saliency detection performance. By simulating the memory mechanism and visual attention mechanism of human beings when watching a video, we propose a step-gained fully convolutional network by combining the memory information on the time axis with the motion information on the space axis while storing the saliency information of the current frame. The model is obtained through hierarchical training, which ensures the accuracy of the detection. Extensive experiments in comparison with 11 state-of-the-art methods are carried out, and the results show that our proposed model outperforms all 11 methods across a number of publicly available datasets.
Collapse
|
50
|
Williams LH, Drew T. What do we know about volumetric medical image interpretation?: a review of the basic science and medical image perception literatures. Cogn Res Princ Implic 2019; 4:21. [PMID: 31286283 PMCID: PMC6614227 DOI: 10.1186/s41235-019-0171-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 05/19/2019] [Indexed: 11/26/2022] Open
Abstract
Interpretation of volumetric medical images represents a rapidly growing proportion of the workload in radiology. However, relatively little is known about the strategies that best guide search behavior when looking for abnormalities in volumetric images. Although there is extensive literature on two-dimensional medical image perception, it is an open question whether the conclusions drawn from these images can be generalized to volumetric images. Importantly, volumetric images have distinct characteristics (e.g., scrolling through depth, smooth-pursuit eye-movements, motion onset cues, etc.) that should be considered in future research. In this manuscript, we will review the literature on medical image perception and discuss relevant findings from basic science that can be used to generate predictions about expertise in volumetric image interpretation. By better understanding search through volumetric images, we may be able to identify common sources of error, characterize the optimal strategies for searching through depth, or develop new training and assessment techniques for radiology residents.
Collapse
|