1
|
Wang Y, Bace M, Bulling A. Scanpath Prediction on Information Visualisations. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2024; 30:3902-3914. [PMID: 37022407 DOI: 10.1109/tvcg.2023.3242293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
We propose Unified Model of Saliency and Scanpaths (UMSS)- a model that learns to predict multi-duration saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect to several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5% for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6% for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.
Collapse
|
2
|
Wang Y, Li H, Jiang Q. Dynamically attentive viewport sequence for no-reference quality assessment of omnidirectional images. Front Neurosci 2022; 16:1022041. [PMID: 36507332 PMCID: PMC9727405 DOI: 10.3389/fnins.2022.1022041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 10/13/2022] [Indexed: 11/24/2022] Open
Abstract
Omnidirectional images (ODIs) have drawn great attention in virtual reality (VR) due to the capability of providing an immersive experience to users. However, ODIs are usually subject to various quality degradations during different processing stages. Thus, the quality assessment of ODIs is of critical importance to the community of VR. The quality assessment of ODIs is quite different from that of traditional 2D images. Existing IQA methods focus on extracting features from spherical scenes while ignoring the characteristics of actual viewing behavior of humans in continuously browsing an ODI through HMD and failing to characterize the temporal dynamics of the browsing process in terms of the temporal order of viewports. In this article, we resort to the law of gravity to detect the dynamically attentive regions of humans when viewing ODIs. In this article, we propose a novel no-reference (NR) ODI quality evaluation method by making efforts on two aspects including the construction of Dynamically Attentive Viewport Sequence (DAVS) from ODIs and the extraction of Quality-Aware Features (QAFs) from DAVS. The construction of DAVS aims to build a sequence of viewports that are likely to be explored by viewers based on the prediction of visual scanpath when viewers are freely exploring the ODI within the exploration time via HMD. A DAVS that contains only global motion can then be obtained by sampling a series of viewports from the ODI along the predicted visual scanpath. The subsequent quality evaluation of ODIs is performed merely based on the DAVS. The extraction of QAFs aims to obtain effective feature representations that are highly discriminative in terms of perceived distortion and visual quality. Finally, we can adopt a regression model to map the extracted QAFs to a single predicted quality score. Experimental results on two datasets demonstrate that the proposed method is able to deliver state-of-the-art performance.
Collapse
Affiliation(s)
- Yuhong Wang
- School of Information Science and Engineering, Ningbo University, Ningbo, China,College of Science and Technology, Ningbo University, Ningbo, China
| | - Hong Li
- College of Science and Technology, Ningbo University, Ningbo, China,*Correspondence: Hong Li,
| | - Qiuping Jiang
- School of Information Science and Engineering, Ningbo University, Ningbo, China
| |
Collapse
|
3
|
Martin D, Serrano A, Bergman AW, Wetzstein G, Masia B. ScanGAN360: A Generative Model of Realistic Scanpaths for 360° Images. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2022; 28:2003-2013. [PMID: 35167469 DOI: 10.1109/tvcg.2022.3150502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Understanding and modeling the dynamics of human gaze behavior in 360° environments is crucial for creating, improving, and developing emerging virtual reality applications. However, recruiting human observers and acquiring enough data to analyze their behavior when exploring virtual environments requires complex hardware and software setups, and can be time-consuming. Being able to generate virtual observers can help overcome this limitation, and thus stands as an open problem in this medium. Particularly, generative adversarial approaches could alleviate this challenge by generating a large number of scanpaths that reproduce human behavior when observing new scenes, essentially mimicking virtual observers. However, existing methods for scanpath generation do not adequately predict realistic scanpaths for 360° images. We present ScanGAN360, a new generative adversarial approach to address this problem. We propose a novel loss function based on dynamic time warping and tailor our network to the specifics of 360° images. The quality of our generated scanpaths outperforms competing approaches by a large margin, and is almost on par with the human baseline. ScanGAN360 allows fast simulation of large numbers of virtual observers, whose behavior mimics real users, enabling a better understanding of gaze behavior, facilitating experimentation, and aiding novel applications in virtual reality and beyond.
Collapse
|
4
|
Abstract
By and large, the remarkable progress in visual object recognition in the last few years has been fueled by the availability of huge amounts of labelled data paired with powerful, bespoke computational resources. This has opened the doors to the massive use of deep learning, which has led to remarkable improvements on new challenging benchmarks. While acknowledging this point of view, in this paper I claim that the time has come to begin working towards a deeper understanding of visual computational processes that, instead of being regarded as applications of general purpose machine learning algorithms, are likely to require tailored learning schemes. A major claim of in this paper is that current approaches to object recognition lead to facing a problem that is significantly more difficult than the one offered by nature. This is because of learning algorithms that work on images in isolation, while neglecting the crucial role of temporal coherence. Starting from this remark, this paper raises ten questions concerning visual computational processes that might contribute to better solutions to a number of challenging computer vision tasks. While this paper is far from being able to provide answers to those questions, it contains some insights that might stimulate an in-depth re-thinking in object perception, while suggesting research directions in the control of object-directed action.
Collapse
|
5
|
Betti A, Boccignone G, Faggi L, Gori M, Melacci S. Visual Features and Their Own Optical Flow. Front Artif Intell 2021; 4:768516. [PMID: 34927064 PMCID: PMC8672218 DOI: 10.3389/frai.2021.768516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Accepted: 10/25/2021] [Indexed: 11/14/2022] Open
Abstract
Symmetries, invariances and conservation equations have always been an invaluable guide in Science to model natural phenomena through simple yet effective relations. For instance, in computer vision, translation equivariance is typically a built-in property of neural architectures that are used to solve visual tasks; networks with computational layers implementing such a property are known as Convolutional Neural Networks (CNNs). This kind of mathematical symmetry, as well as many others that have been recently studied, are typically generated by some underlying group of transformations (translations in the case of CNNs, rotations, etc.) and are particularly suitable to process highly structured data such as molecules or chemical compounds which are known to possess those specific symmetries. When dealing with video streams, common built-in equivariances are able to handle only a small fraction of the broad spectrum of transformations encoded in the visual stimulus and, therefore, the corresponding neural architectures have to resort to a huge amount of supervision in order to achieve good generalization capabilities. In the paper we formulate a theory on the development of visual features that is based on the idea that movement itself provides trajectories on which to impose consistency. We introduce the principle of Material Point Invariance which states that each visual feature is invariant with respect to the associated optical flow, so that features and corresponding velocities are an indissoluble pair. Then, we discuss the interaction of features and velocities and show that certain motion invariance traits could be regarded as a generalization of the classical concept of affordance. These analyses of feature-velocity interactions and their invariance properties leads to a visual field theory which expresses the dynamical constraints of motion coherence and might lead to discover the joint evolution of the visual features along with the associated optical flows.
Collapse
Affiliation(s)
- Alessandro Betti
- Department of Information Engineering and Mathematics, Università degli Studi di Siena, Siena, Italy
| | - Giuseppe Boccignone
- PHuSe Lab, Department of Computer Science, Università degli Studi di Milano, Milan, Italy
| | - Lapo Faggi
- Department of Information Engineering and Mathematics, Università degli Studi di Siena, Siena, Italy.,Department of Information Engineering, Università degli Studi di Firenze, Firenze, Italy
| | - Marco Gori
- Department of Information Engineering and Mathematics, Università degli Studi di Siena, Siena, Italy.,Universitè Côte D'Azur, Inria, CNRS, I3S, Maasai, Sophia-Antipolis, France
| | - Stefano Melacci
- Department of Information Engineering and Mathematics, Università degli Studi di Siena, Siena, Italy
| |
Collapse
|
6
|
Xia C, Han J, Zhang D. Evaluation of Saccadic Scanpath Prediction: Subjective Assessment Database and Recurrent Neural Network Based Metric. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:4378-4395. [PMID: 32750785 DOI: 10.1109/tpami.2020.3002168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In recent years, predicting the saccadic scanpaths of humans has become a new trend in the field of visual attention modeling. Given various saccadic algorithms, determining how to evaluate their ability to model a dynamic saccade has become an important yet understudied issue. To our best knowledge, existing metrics for evaluating saccadic prediction models are often heuristically designed, which may produce results that are inconsistent with human subjective assessment. To this end, we first construct a subjective database by collecting the assessments on 5,000 pairs of scanpaths from ten subjects. Based on this database, we can compare different metrics according to their consistency with human visual perception. In addition, we also propose a data-driven metric to measure scanpath similarity based on the human subjective comparison. To achieve this goal, we employ a long short-term memory (LSTM) network to learn the inference from the relationship of encoded scanpaths to a binary measurement. Experimental results have demonstrated that the LSTM-based metric outperforms other existing metrics. Moreover, we believe the constructed database can be used as a benchmark to inspire more insights for future metric selection.
Collapse
|
7
|
D'Amelio A, Boccignone G. Gazing at Social Interactions Between Foraging and Decision Theory. Front Neurorobot 2021; 15:639999. [PMID: 33859558 PMCID: PMC8042312 DOI: 10.3389/fnbot.2021.639999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Accepted: 03/09/2021] [Indexed: 11/30/2022] Open
Abstract
Finding the underlying principles of social attention in humans seems to be essential for the design of the interaction between natural and artificial agents. Here, we focus on the computational modeling of gaze dynamics as exhibited by humans when perceiving socially relevant multimodal information. The audio-visual landscape of social interactions is distilled into a number of multimodal patches that convey different social value, and we work under the general frame of foraging as a tradeoff between local patch exploitation and landscape exploration. We show that the spatio-temporal dynamics of gaze shifts can be parsimoniously described by Langevin-type stochastic differential equations triggering a decision equation over time. In particular, value-based patch choice and handling is reduced to a simple multi-alternative perceptual decision making that relies on a race-to-threshold between independent continuous-time perceptual evidence integrators, each integrator being associated with a patch.
Collapse
Affiliation(s)
- Alessandro D'Amelio
- PHuSe Lab, Department of Computer Science, Universitá degli Studi di Milano, Milan, Italy
| | - Giuseppe Boccignone
- PHuSe Lab, Department of Computer Science, Universitá degli Studi di Milano, Milan, Italy
| |
Collapse
|
8
|
Gravitational models explain shifts on human visual attention. Sci Rep 2020; 10:16335. [PMID: 33005008 PMCID: PMC7530662 DOI: 10.1038/s41598-020-73494-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2020] [Accepted: 09/11/2020] [Indexed: 11/17/2022] Open
Abstract
Visual attention refers to the human brain’s ability to select relevant sensory information for preferential processing, improving performance in visual and cognitive tasks. It proceeds in two phases. One in which visual feature maps are acquired and processed in parallel. Another where the information from these maps is merged in order to select a single location to be attended for further and more complex computations and reasoning. Its computational description is challenging, especially if the temporal dynamics of the process are taken into account. Numerous methods to estimate saliency have been proposed in the last 3 decades. They achieve almost perfect performance in estimating saliency at the pixel level, but the way they generate shifts in visual attention fully depends on winner-take-all (WTA) circuitry. WTA is implemented by the biological hardware in order to select a location with maximum saliency, towards which to direct overt attention. In this paper we propose a gravitational model to describe the attentional shifts. Every single feature acts as an attractor and the shifts are the result of the joint effects of the attractors. In the current framework, the assumption of a single, centralized saliency map is no longer necessary, though still plausible. Quantitative results on two large image datasets show that this model predicts shifts more accurately than winner-take-all.
Collapse
|