1
|
Ndayikengurukiye D, Mignotte M. CoSOV1Net: A Cone- and Spatial-Opponent Primary Visual Cortex-Inspired Neural Network for Lightweight Salient Object Detection. SENSORS (BASEL, SWITZERLAND) 2023; 23:6450. [PMID: 37514744 PMCID: PMC10386563 DOI: 10.3390/s23146450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 07/12/2023] [Accepted: 07/14/2023] [Indexed: 07/30/2023]
Abstract
Salient object-detection models attempt to mimic the human visual system's ability to select relevant objects in images. To this end, the development of deep neural networks on high-end computers has recently achieved high performance. However, developing deep neural network models with the same performance for resource-limited vision sensors or mobile devices remains a challenge. In this work, we propose CoSOV1net, a novel lightweight salient object-detection neural network model, inspired by the cone- and spatial-opponent processes of the primary visual cortex (V1), which inextricably link color and shape in human color perception. Our proposed model is trained from scratch, without using backbones from image classification or other tasks. Experiments on the most widely used and challenging datasets for salient object detection show that CoSOV1Net achieves competitive performance (i.e., Fβ=0.931 on the ECSSD dataset) with state-of-the-art salient object-detection models while having a low number of parameters (1.14 M), low FLOPS (1.4 G) and high FPS (211.2) on GPU (Nvidia GeForce RTX 3090 Ti) compared to the state of the art in lightweight or nonlightweight salient object-detection tasks. Thus, CoSOV1net has turned out to be a lightweight salient object-detection model that can be adapted to mobile environments and resource-constrained devices.
Collapse
Affiliation(s)
- Didier Ndayikengurukiye
- Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montreal, QC H3C 3J7, Canada
| | - Max Mignotte
- Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montreal, QC H3C 3J7, Canada
| |
Collapse
|
2
|
Zhou W, Zhu Y, Lei J, Yang R, Yu L. LSNet: Lightweight Spatial Boosting Network for Detecting Salient Objects in RGB-Thermal Images. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; PP:1329-1340. [PMID: 37022901 DOI: 10.1109/tip.2023.3242775] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Most recent methods for RGB (red-green-blue)-thermal salient object detection (SOD) involve several floating-point operations and have numerous parameters, resulting in slow inference, especially on common processors, and impeding their deployment on mobile devices for practical applications. To address these problems, we propose a lightweight spatial boosting network (LSNet) for efficient RGB-thermal SOD with a lightweight MobileNetV2 backbone to replace a conventional backbone (e.g., VGG, ResNet). To improve feature extraction using a lightweight backbone, we propose a boundary boosting algorithm that optimizes the predicted saliency maps and reduces information collapse in low-dimensional features. The algorithm generates boundary maps based on predicted saliency maps without incurring additional calculations or complexity. As multimodality processing is essential for high-performance SOD, we adopt attentive feature distillation and selection and propose semantic and geometric transfer learning to enhance the backbone without increasing the complexity during testing. Experimental results demonstrate that the proposed LSNet achieves state-of-the-art performance compared with 14 RGB-thermal SOD methods on three datasets while improving the numbers of floating-point operations (1.025G) and parameters (5.39M), model size (22.1 MB), and inference speed (9.95 fps for PyTorch, batch size of 1, and Intel i5-7500 processor; 93.53 fps for PyTorch, batch size of 1, and NVIDIA TITAN V graphics processor; 936.68 fps for PyTorch, batch size of 20, and graphics processor; 538.01 fps for TensorRT and batch size of 1; and 903.01 fps for TensorRT/FP16 and batch size of 1). The code and results can be found from the link of https://github.com/zyrant/LSNet.
Collapse
|
3
|
Huang K, Tian C, Su J, Lin JCW. Transformer-based Cross Reference Network for video salient object detection. Pattern Recognit Lett 2022. [DOI: 10.1016/j.patrec.2022.06.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
4
|
Liu Y, Zhang D, Zhang Q, Han J. Part-Object Relational Visual Saliency. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:3688-3704. [PMID: 33481705 DOI: 10.1109/tpami.2021.3053577] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recent years have witnessed a big leap in automatic visual saliency detection attributed to advances in deep learning, especially Convolutional Neural Networks (CNNs). However, inferring the saliency of each image part separately, as was adopted by most CNNs methods, inevitably leads to an incomplete segmentation of the salient object. In this paper, we describe how to use the property of part-object relations endowed by the Capsule Network (CapsNet) to solve the problems that fundamentally hinge on relational inference for visual saliency detection. Concretely, we put in place a two-stream strategy, termed Two-Stream Part-Object RelaTional Network (TSPORTNet), to implement CapsNet, aiming to reduce both the network complexity and the possible redundancy during capsule routing. Additionally, taking into account the correlations of capsule types from the preceding training images, a correlation-aware capsule routing algorithm is developed for more accurate capsule assignments at the training stage, which also speeds up the training dramatically. By exploring part-object relationships, TSPORTNet produces a capsule wholeness map, which in turn aids multi-level features in generating the final saliency map. Experimental results on five widely-used benchmarks show that our framework consistently achieves state-of-the-art performance. The code can be found on https://github.com/liuyi1989/TSPORTNet.
Collapse
|
5
|
DeepRare: Generic Unsupervised Visual Attention Models. ELECTRONICS 2022. [DOI: 10.3390/electronics11111696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Visual attention selects data considered as “interesting” by humans, and it is modeled in the field of engineering by feature-engineered methods finding contrasted/surprising/unusual image data. Deep learning drastically improved the models efficiency on the main benchmark datasets. However, Deep Neural Networks-based (DNN-based) models are counterintuitive: surprising or unusual data are by definition difficult to learn because of their low occurrence probability. In reality, DNN-based models mainly learn top-down features such as faces, text, people, or animals which usually attract human attention, but they have low efficiency in extracting surprising or unusual data in the images. In this article, we propose a new family of visual attention models called DeepRare and especially DeepRare2021 (DR21), which uses the power of DNNs’ feature extraction and the genericity of feature-engineered algorithms. This algorithm is an evolution of a previous version called DeepRare2019 (DR19) based on this common framework. DR21 (1) does not need any additional training other than the default ImageNet training, (2) is fast even on CPU, (3) is tested on four very different eye-tracking datasets showing that DR21 is generic and is always within the top models on all datasets and metrics while no other model exhibits such a regularity and genericity. Finally, DR21 (4) is tested with several network architectures such as VGG16 (V16), VGG19 (V19), and MobileNetV2 (MN2), and (5) it provides explanation and transparency on which parts of the image are the most surprising at different levels despite the use of a DNN-based feature extractor.
Collapse
|
6
|
Ndayikengurukiye D, Mignotte M. Salient Object Detection by LTP Texture Characterization on Opposing Color Pairs under SLICO Superpixel Constraint. J Imaging 2022; 8:jimaging8040110. [PMID: 35448237 PMCID: PMC9027508 DOI: 10.3390/jimaging8040110] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Revised: 03/31/2022] [Accepted: 04/05/2022] [Indexed: 02/05/2023] Open
Abstract
The effortless detection of salient objects by humans has been the subject of research in several fields, including computer vision, as it has many applications. However, salient object detection remains a challenge for many computer models dealing with color and textured images. Most of them process color and texture separately and therefore implicitly consider them as independent features which is not the case in reality. Herein, we propose a novel and efficient strategy, through a simple model, almost without internal parameters, which generates a robust saliency map for a natural image. This strategy consists of integrating color information into local textural patterns to characterize a color micro-texture. It is the simple, yet powerful LTP (Local Ternary Patterns) texture descriptor applied to opposing color pairs of a color space that allows us to achieve this end. Each color micro-texture is represented by a vector whose components are from a superpixel obtained by the SLICO (Simple Linear Iterative Clustering with zero parameter) algorithm, which is simple, fast and exhibits state-of-the-art boundary adherence. The degree of dissimilarity between each pair of color micro-textures is computed by the FastMap method, a fast version of MDS (Multi-dimensional Scaling) that considers the color micro-textures’ non-linearity while preserving their distances. These degrees of dissimilarity give us an intermediate saliency map for each RGB (Red–Green–Blue), HSL (Hue–Saturation–Luminance), LUV (L for luminance, U and V represent chromaticity values) and CMY (Cyan–Magenta–Yellow) color space. The final saliency map is their combination to take advantage of the strength of each of them. The MAE (Mean Absolute Error), MSE (Mean Squared Error) and Fβ measures of our saliency maps, on the five most used datasets show that our model outperformed several state-of-the-art models. Being simple and efficient, our model could be combined with classic models using color contrast for a better performance.
Collapse
|
7
|
RGB-T salient object detection via CNN feature and result saliency map fusion. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02984-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
Video saliency aware intelligent HD video compression with the improvement of visual quality and the reduction of coding complexity. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-06895-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
9
|
Review of Visual Saliency Prediction: Development Process from Neurobiological Basis to Deep Models. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app12010309] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The human attention mechanism can be understood and simulated by closely associating the saliency prediction task to neuroscience and psychology. Furthermore, saliency prediction is widely used in computer vision and interdisciplinary subjects. In recent years, with the rapid development of deep learning, deep models have made amazing achievements in saliency prediction. Deep learning models can automatically learn features, thus solving many drawbacks of the classic models, such as handcrafted features and task settings, among others. Nevertheless, the deep models still have some limitations, for example in tasks involving multi-modality and semantic understanding. This study focuses on summarizing the relevant achievements in the field of saliency prediction, including the early neurological and psychological mechanisms and the guiding role of classic models, followed by the development process and data comparison of classic and deep saliency prediction models. This study also discusses the relationship between the model and human vision, as well as the factors that cause the semantic gaps, the influences of attention in cognitive research, the limitations of the saliency model, and the emerging applications, to provide new saliency predictions for follow-up work and the necessary help and advice.
Collapse
|
10
|
Video Compression for Screen Recorded Sequences Following Eye Movements. JOURNAL OF SIGNAL PROCESSING SYSTEMS 2021; 93:1457-1465. [PMID: 34840642 PMCID: PMC8610366 DOI: 10.1007/s11265-021-01719-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Revised: 10/28/2021] [Accepted: 11/08/2021] [Indexed: 11/04/2022]
Abstract
With the advent of smartphones and tablets, video traffic on the Internet has increased enormously. With this in mind, in 2013 the High Efficiency Video Coding (HEVC) standard was released with the aim of reducing the bit rate (at the same quality) by 50% with respect to its predecessor. However, new contents with greater resolutions and requirements appear every day, making it necessary to further reduce the bit rate. Perceptual video coding has recently been recognized as a promising approach to achieving high-performance video compression and eye tracking data can be used to create and verify these models. In this paper, we present a new algorithm for the bit rate reduction of screen recorded sequences based on the visual perception of videos. An eye tracking system is used during the recording to locate the fixation point of the viewer. Then, the area around that point is encoded with the base quantization parameter (QP) value, which increases when moving away from it. The results show that up to 31.3% of the bit rate may be saved when compared with the original HEVC-encoded sequence, without a significant impact on the perceived quality.
Collapse
|
11
|
First Trimester Gaze Pattern Estimation Using Stochastic Augmentation Policy Search for Single Frame Saliency Prediction. MEDICAL IMAGE UNDERSTANDING AND ANALYSIS : 25TH ANNUAL CONFERENCE, MIUA 2021, OXFORD, UNITED KINGDOM, JULY 12-14, 2021, PROCEEDINGS. MEDICAL IMAGE UNDERSTANDING AND ANALYSIS (CONFERENCE) (25TH : 2021 : ONLINE) 2021. [PMID: 34476423 DOI: 10.1007/978-3-030-80432-9_28] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register]
Abstract
While performing an ultrasound (US) scan, sonographers direct their gaze at regions of interest to verify that the correct plane is acquired and to interpret the acquisition frame. Predicting sonographer gaze on US videos is useful for identification of spatio-temporal patterns that are important for US scanning. This paper investigates utilizing sonographer gaze, in the form of gaze-tracking data, in a multimodal imaging deep learning framework to assist the analysis of the first trimester fetal ultrasound scan. Specifically, we propose an encoderdecoder convolutional neural network with skip connections to predict the visual gaze for each frame using 115 first trimester ultrasound videos; 29,250 video frames for training, 7,290 for validation and 9,126 for testing. We find that the dataset of our size benefits from automated data augmentation, which in turn, alleviates model overfitting and reduces structural variation imbalance of US anatomical views between the training and test datasets. Specifically, we employ a stochastic augmentation policy search method to improve segmentation performance. Using the learnt policies, our models outperform the baseline: KLD, SIM, NSS and CC (2.16, 0.27, 4.34 and 0.39 versus 3.17, 0.21, 2.92 and 0.28).
Collapse
|
12
|
Wu Z, Su L, Huang Q. Decomposition and Completion Network for Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:6226-6239. [PMID: 34242166 DOI: 10.1109/tip.2021.3093380] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Recently, fully convolutional networks (FCNs) have made great progress in the task of salient object detection and existing state-of-the-arts methods mainly focus on how to integrate edge information in deep aggregation models. In this paper, we propose a novel Decomposition and Completion Network (DCN), which integrates edge and skeleton as complementary information and models the integrity of salient objects in two stages. In the decomposition network, we propose a cross multi-branch decoder, which iteratively takes advantage of cross-task aggregation and cross-layer aggregation to integrate multi-level multi-task features and predict saliency, edge, and skeleton maps simultaneously. In the completion network, edge and skeleton maps are further utilized to fill flaws and suppress noises in saliency maps via hierarchical structure-aware feature learning and multi-scale feature completion. Through jointly learning with edge and skeleton information for localizing boundaries and interiors of salient objects respectively, the proposed network generates precise saliency maps with uniformly and completely segmented salient objects. Experiments conducted on five benchmark datasets demonstrate that the proposed model outperforms existing networks. Furthermore, we extend the proposed model to the task of RGB-D salient object detection, and it also achieves state-of-the-art performance. The code is available at https://github.com/wuzhe71/DCN.
Collapse
|
13
|
Wang W, Shen J, Lu X, Hoi SCH, Ling H. Paying Attention to Video Object Pattern Understanding. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:2413-2428. [PMID: 31940522 DOI: 10.1109/tpami.2020.2966453] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper conducts a systematic study on the role of visual attention in video object pattern understanding. By elaborately annotating three popular video segmentation datasets (DAVIS 16, Youtube-Objects, and SegTrack V2) with dynamic eye-tracking data in the unsupervised video object segmentation (UVOS) setting. For the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgments during dynamic, task-driven viewing. Such novel observations provide an in-depth insight of the underlying rationale behind video object pattens. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major advantages: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on four popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance compared with state-of-the-arts and enjoys fast processing speed (10 fps on a single GPU). Our collected eye-tracking data and algorithm implementations have been made publicly available at https://github.com/wenguanwang/AGS.
Collapse
|
14
|
|
15
|
Unsupervised foveal vision neural architecture with top-down attention. Neural Netw 2021; 141:145-159. [PMID: 33901879 DOI: 10.1016/j.neunet.2021.03.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 02/23/2021] [Accepted: 03/02/2021] [Indexed: 11/20/2022]
Abstract
Deep learning architectures are an extremely powerful tool for recognizing and classifying images. However, they require supervised learning and normally work on vectors of the size of image pixels and produce the best results when trained on millions of object images. To help mitigate these issues, we propose an end-to-end architecture that fuses bottom-up saliency and top-down attention with an object recognition module to focus on relevant data and learn important features that can later be fine-tuned for a specific task, employing only unsupervised learning. In addition, by utilizing a virtual fovea that focuses on relevant portions of the data, the training speed can be greatly improved. We test the performance of the proposed Gamma saliency technique on the Toronto and CAT 2000 databases, and the foveated vision in the large Street View House Numbers (SVHN) database. The results with foveated vision show that Gamma saliency performs at the same level as the best alternative algorithms while being computationally faster. The results in SVHN show that our unsupervised cognitive architecture is comparable to fully supervised methods and that saliency also improves CNN performance if desired. Finally, we develop and test a top-down attention mechanism based on the Gamma saliency applied to the top layer of CNNs to facilitate scene understanding in multi-object cluttered images. We show that the extra information from top-down saliency is capable of speeding up the extraction of digits in the cluttered multidigit MNIST data set, corroborating the important role of top down attention.
Collapse
|
16
|
|
17
|
Wang W, Shen J, Xie J, Cheng MM, Ling H, Borji A. Revisiting Video Saliency Prediction in the Deep Learning Era. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:220-237. [PMID: 31247542 DOI: 10.1109/tpami.2019.2924417] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.
Collapse
|
18
|
Bi H, Yang L, Zhu H, Lu D, Jiang J. STEG-Net: Spatio-Temporal Edge Guidance Network for Video Salient Object Detection. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2021.3078824] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
19
|
Visual fixation prediction with incomplete attention map based on brain storm optimization. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106653] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
20
|
Hülemeier AG, Lappe M. Combining biological motion perception with optic flow analysis for self-motion in crowds. J Vis 2020; 20:7. [PMID: 32902593 PMCID: PMC7488621 DOI: 10.1167/jov.20.9.7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Heading estimation from optic flow relies on the assumption that the visual world is rigid. This assumption is violated when one moves through a crowd of people, a common and socially important situation. The motion of people in the crowd contains cues to their translation in the form of the articulation of their limbs, known as biological motion. We investigated how translation and articulation of biological motion influence heading estimation from optic flow for self-motion in a crowd. Participants had to estimate their heading during simulated self-motion toward a group of walkers who collectively walked in a single direction. We found that the natural combination of translation and articulation produces surprisingly small heading errors. In contrast, experimental conditions that either present only translation or only articulation produced strong idiosyncratic biases. The individual biases explained well the variance in the natural combination. A second experiment showed that the benefit of articulation and the bias produced by articulation were specific to biological motion. An analysis of the differences in biases between conditions and participants showed that different perceptual mechanisms contribute to heading perception in crowds. We suggest that coherent group motion affects the reference frame of heading perception from optic flow.
Collapse
Affiliation(s)
| | - Markus Lappe
- Department of Psychology, University of Münster, Münster, Germany
| |
Collapse
|
21
|
Jiang L, Xu M, Wang Z, Sigal L. DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention. Int J Comput Vis 2020. [DOI: 10.1007/s11263-020-01371-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
22
|
Moroto Y, Maeda K, Ogawa T, Haseyama M. Few-Shot Personalized Saliency Prediction Based on Adaptive Image Selection Considering Object and Visual Attention. SENSORS (BASEL, SWITZERLAND) 2020; 20:E2170. [PMID: 32290495 PMCID: PMC7218730 DOI: 10.3390/s20082170] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 04/05/2020] [Accepted: 04/09/2020] [Indexed: 11/18/2022]
Abstract
A few-shot personalized saliency prediction based on adaptive image selection considering object and visual attention is presented in this paper. Since general methods predicting personalized saliency maps (PSMs) need a large number of training images, the establishment of a theory using a small number of training images is needed. To tackle this problem, although finding persons who have visual attention similar to that of a target person is effective, all persons have to commonly gaze at many images. Thus, it becomes difficult and unrealistic when considering their burden. On the other hand, this paper introduces a novel adaptive image selection (AIS) scheme that focuses on the relationship between human visual attention and objects in images. AIS focuses on both a diversity of objects in images and a variance of PSMs for the objects. Specifically, AIS selects images so that selected images have various kinds of objects to maintain their diversity. Moreover, AIS guarantees the high variance of PSMs for persons since it represents the regions that many persons commonly gaze at or do not gaze at. The proposed method enables selecting similar users from a small number of images by selecting images that have high diversities and variances. This is the technical contribution of this paper. Experimental results show the effectiveness of our personalized saliency prediction including the new image selection scheme.
Collapse
Affiliation(s)
- Yuya Moroto
- Graduate School of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan
| | - Keisuke Maeda
- Office of Institutional Research, Hokkaido University, N-8, W-5, Kita-ku, Sapporo, Hokkaido 060-0808, Japan
| | - Takahiro Ogawa
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan; (T.O.); (M.H.)
| | - Miki Haseyama
- Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo, Hokkaido 060-0814, Japan; (T.O.); (M.H.)
| |
Collapse
|
23
|
Zhang P, Liu J, Wang X, Pu T, Fei C, Guo Z. Stereoscopic video saliency detection based on spatiotemporal correlation and depth confidence optimization. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.10.024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
24
|
Ma H, Acton ST, Lin Z. SITUP: Scale Invariant Tracking using Average Peak-to-Correlation Energy. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:3546-3557. [PMID: 31944955 DOI: 10.1109/tip.2019.2962694] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Robust and accurate scale estimation of a target object is a challenging task in visual object tracking. Most existing tracking methods cannot accommodate large scale variation in complex image sequences and thus result in inferior performance. In this paper, we propose to incorporate a novel criterion called the average peak-to-correlation energy into the multi-resolution translation filter framework to obtain robust and accurate scale estimation. The resulting system is named SITUP: Scale Invariant Tracking using Average Peak-to-Correlation Energy. SITUP effectively tackles the problem of fixed template size in standard discriminative correlation filter based trackers. Extensive empirical evaluation on the publicly available tracking benchmark datasets demonstrates that the proposed scale searching framework meets the demands of scale variation challenges effectively while providing superior performance over other scale adaptive variants of standard discriminative correlation filter based trackers. Also, SITUP obtains favorable performance compared to state-of-the-art trackers for various scenarios while operating in real-time on a single CPU.
Collapse
|
25
|
Huang R, Feng W, Wang Z, Xing Y, Zou Y. Exemplar-based image saliency and co-saliency detection. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.09.011] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
26
|
Chen L, Zheng Z, Bian P, Mo J, Khodja AE. Multi-scale fusion and non-local attention mechanism based salient object detection. MATEC WEB OF CONFERENCES 2020; 309:03028. [DOI: 10.1051/matecconf/202030903028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
Abstract
With the development of deep learning, researches in the field of computer vision are attracting more attention. As the pre-processing operation of visual tasks, a salient model may focus on pure architectures. The paper proposes a new multi-scale fusion network to enrich high-level redundant information with the enlarged receptive field. With the guidance of attention mechanism, the framework can capture more effective correlation spatial and channels information. Building a short-connection between high-level and each level features to transmit the contextual features. The model can be used in a variety of complex scenes for end-to- end image detection, with simple structure and strong versatility. Experimental results obtained on multiple common datasets have shown that the proposed model achieved better performance both in the visual effect and the accuracy for small object and multi-target detection.
Collapse
|
27
|
Embedding topological features into convolutional neural network salient object detection. Neural Netw 2020; 121:308-318. [DOI: 10.1016/j.neunet.2019.09.009] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Revised: 09/02/2019] [Accepted: 09/06/2019] [Indexed: 11/19/2022]
|
28
|
Xu Y, Gao S, Wu J, Li N, Yu J. Personalized Saliency and Its Prediction. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2019; 41:2975-2989. [PMID: 30136932 DOI: 10.1109/tpami.2018.2866563] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Nearly all existing visual saliency models by far have focused on predicting a universal saliency map across all observers. Yet psychology studies suggest that visual attention of different observers can vary significantly under specific circumstances, especially a scene is composed of multiple salient objects. To study such heterogenous visual attention pattern across observers, we first construct a personalized saliency dataset and explore correlations between visual attention, personal preferences, and image contents. Specifically, we propose to decompose a personalized saliency map (referred to as PSM) into a universal saliency map (referred to as USM) predictable by existing saliency detection models and a new discrepancy map across users that characterizes personalized saliency. We then present two solutions towards predicting such discrepancy maps, i.e., a multi-task convolutional neural network (CNN) framework and an extended CNN with Person-specific Information Encoded Filters (CNN-PIEF). Extensive experimental results demonstrate the effectiveness of our models for PSM prediction as well their generalization capability for unseen observers.
Collapse
|
29
|
Assessment of feature fusion strategies in visual attention mechanism for saliency detection. Pattern Recognit Lett 2019. [DOI: 10.1016/j.patrec.2018.08.022] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
30
|
Saliency detection via double nuclear norm maximization and ensemble manifold regularization. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2019.07.021] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
31
|
Xu M, Song Y, Wang J, Qiao M, Huo L, Wang Z. Predicting Head Movement in Panoramic Video: A Deep Reinforcement Learning Approach. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2019; 41:2693-2708. [PMID: 30047871 DOI: 10.1109/tpami.2018.2858783] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Panoramic video provides immersive and interactive experience by enabling humans to control the field of view (FoV) through head movement (HM). Thus, HM plays a key role in modeling human attention on panoramic video. This paper establishes a database collecting subjects' HM in panoramic video sequences. From this database, we find that the HM data are highly consistent across subjects. Furthermore, we find that deep reinforcement learning (DRL) can be applied to predict HM positions, via maximizing the reward of imitating human HM scanpaths through the agent's actions. Based on our findings, we propose a DRL-based HM prediction (DHP) approach with offline and online versions, called offline-DHP and online-DHP. In offline-DHP, multiple DRL workflows are run to determine potential HM positions at each panoramic frame. Then, a heat map of the potential HM positions, named the HM map, is generated as the output of offline-DHP. In online-DHP, the next HM position of one subject is estimated given the currently observed HM position, which is achieved by developing a DRL algorithm upon the learned offline-DHP model. Finally, the experiments validate that our approach is effective in both offline and online prediction of HM positions for panoramic video, and that the learned offline-DHP model can improve the performance of online-DHP.
Collapse
|
32
|
Li J, Fu K, Zhao S, Ge S. Spatiotemporal Knowledge Distillation for Efficient Estimation of Aerial Video Saliency. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:1902-1914. [PMID: 31613767 DOI: 10.1109/tip.2019.2946102] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The performance of video saliency estimation techniques has achieved significant advances along with the rapid development of Convolutional Neural Networks (CNNs). However, devices like cameras and drones may have limited computational capability and storage space so that the direct deployment of complex deep saliency models becomes infeasible. To address this problem, this paper proposes a dynamic saliency estimation approach for aerial videos via spatiotemporal knowledge distillation. In this approach, five components are involved, including two teachers, two students and the desired spatiotemporal model. The knowledge of spatial and temporal saliency is first separately transferred from the two complex and redundant teachers to their simple and compact students, while the input scenes are also degraded from high-resolution to low-resolution to remove the probable data redundancy so as to greatly speed up the feature extraction process. After that, the desired spatiotemporal model is further trained by distilling and encoding the spatial and temporal saliency knowledge of two students into a unified network. In this manner, the inter-model redundancy can be removed for the effective estimation of dynamic saliency on aerial videos. Experimental results show that the proposed approach is comparable to 11 state-of-the-art models in estimating visual saliency on aerial videos, while its speed reaches up to 28,738 FPS and 1,490.5 FPS on the GPU and CPU platforms, respectively.
Collapse
|
33
|
Sun M, Zhou Z, Hu Q, Wang Z, Jiang J. SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2900-2911. [PMID: 29993731 DOI: 10.1109/tcyb.2018.2832053] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Data-driven saliency detection has attracted strong interest as a result of applying convolutional neural networks to the detection of eye fixations. Although a number of image-based salient object and fixation detection models have been proposed, video fixation detection still requires more exploration. Different from image analysis, motion and temporal information is a crucial factor affecting human attention when viewing video sequences. Although existing models based on local contrast and low-level features have been extensively researched, they failed to simultaneously consider interframe motion and temporal information across neighboring video frames, leading to unsatisfactory performance when handling complex scenes. To this end, we propose a novel and efficient video eye fixation detection model to improve the saliency detection performance. By simulating the memory mechanism and visual attention mechanism of human beings when watching a video, we propose a step-gained fully convolutional network by combining the memory information on the time axis with the motion information on the space axis while storing the saliency information of the current frame. The model is obtained through hierarchical training, which ensures the accuracy of the detection. Extensive experiments in comparison with 11 state-of-the-art methods are carried out, and the results show that our proposed model outperforms all 11 methods across a number of publicly available datasets.
Collapse
|
34
|
Xiao Y, Jiang B, Zheng A, Zhou A, Hussain A, Tang J. Saliency detection via multi-view graph based saliency optimization. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.03.066] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
35
|
Khanna MT, Ralekar C, Goel A, Chaudhury S, Lall B. Memorability‐based image compression. IET IMAGE PROCESSING 2019; 13:1490-1501. [DOI: 10.1049/iet-ipr.2018.6097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/19/2023]
Affiliation(s)
- Meera Thapar Khanna
- Multimedia Lab, Department of Electrical EngineeringIndian Institute of Technology DelhiHauz KhasNew Delhi110016India
| | - Chetan Ralekar
- Multimedia Lab, Department of Electrical EngineeringIndian Institute of Technology DelhiHauz KhasNew Delhi110016India
| | - Anurika Goel
- Multimedia Lab, Department of Electrical EngineeringIndian Institute of Technology DelhiHauz KhasNew Delhi110016India
- IBD Analyst at Goldman Sachs BengaluruKarnatakaIndia
| | - Santanu Chaudhury
- Multimedia Lab, Department of Electrical EngineeringIndian Institute of Technology DelhiHauz KhasNew Delhi110016India
- CSIR‐Central Electronics Engineering Research InstitutePilaniIndia
| | - Brejesh Lall
- Multimedia Lab, Department of Electrical EngineeringIndian Institute of Technology DelhiHauz KhasNew Delhi110016India
| |
Collapse
|
36
|
Costela FM, Woods RL. A free database of eye movements watching "Hollywood" videoclips. Data Brief 2019; 25:103991. [PMID: 31428665 PMCID: PMC6693682 DOI: 10.1016/j.dib.2019.103991] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Revised: 04/17/2019] [Accepted: 05/07/2019] [Indexed: 11/25/2022] Open
Abstract
The provided database of tracked eye movements was collected using an infra-red, video-camera Eyelink 1000 system, from 95 participants as they viewed 'Hollywood' video clips. There are 206 clips of 30-s and eleven clips of 30-min for a total viewing time of about 60 hours. The database also provides the raw 30-s video clip files, a short preview of the 30-min clips, and subjective ratings of the content of the videos for each in categories: (1) genre; (2) importance of human faces; (3) importance of human figures; (4) importance of man-made objects; (5) importance of nature; (6) auditory information; (7) lighting; and (8) environment type. Precise timing of the scene cuts within the clips and the democratic gaze scanpath position (center of interest) per frame are provided. At this time, this eye-movement dataset has the widest age range (22-85 years) and is the third largest (in recorded video viewing time) of those that have been made available to the research community. The data-acquisition procedures are described, along with participant demographics, summaries of some common eye-movement statistics, and highlights of research topics in which the database was used. The dataset is freely available in the Open Science Framework repository (link in the manuscript) and can be used without restriction for educational and research purposes, providing that this paper is cited in any published work.
Collapse
Affiliation(s)
- Francisco M Costela
- Schepens Eye Research Institute, Massachusetts Eye and Ear, Boston, MA, USA.,Department of Ophthalmology, Harvard Medical School, Boston, MA, USA
| | - Russell L Woods
- Schepens Eye Research Institute, Massachusetts Eye and Ear, Boston, MA, USA.,Department of Ophthalmology, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
37
|
Utility function generated saccade strategies for robot active vision: a probabilistic approach. Auton Robots 2019. [DOI: 10.1007/s10514-018-9752-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
38
|
Bylinskii Z, Judd T, Oliva A, Torralba A, Durand F. What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2019; 41:740-757. [PMID: 29993800 DOI: 10.1109/tpami.2018.2815601] [Citation(s) in RCA: 107] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
How best to evaluate a saliency model's ability to predict where humans look in images is an open research question. The choice of evaluation metric depends on how saliency is defined and how the ground truth is represented. Metrics differ in how they rank saliency models, and this results from how false positives and false negatives are treated, whether viewing biases are accounted for, whether spatial deviations are factored in, and how the saliency maps are pre-processed. In this paper, we provide an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualizations of metric computations, we add interpretability to saliency scores and more transparency to the evaluation of saliency models. Building off the differences in metric properties and behaviors, we make recommendations for metric selections under specific assumptions and for specific applications.
Collapse
|
39
|
Divide-and-Conquer Dual-Architecture Convolutional Neural Network for Classification of Hyperspectral Images. REMOTE SENSING 2019. [DOI: 10.3390/rs11050484] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Convolutional neural network (CNN) is well-known for its powerful capability on image classification. In hyperspectral images (HSIs), fixed-size spatial window is generally used as the input of CNN for pixel-wise classification. However, single fixed-size spatial architecture hinders the excellent performance of CNN due to the neglect of various land-cover distributions in HSIs. Moreover, insufficient samples in HSIs may cause the overfitting problem. To address these problems, a novel divide-and-conquer dual-architecture CNN (DDCNN) method is proposed for HSI classification. In DDCNN, a novel regional division strategy based on local and non-local decisions is devised to distinguish homogeneous and heterogeneous regions. Then, for homogeneous regions, a multi-scale CNN architecture with larger spatial window inputs is constructed to learn joint spectral-spatial features. For heterogeneous regions, a fine-grained CNN architecture with smaller spatial window inputs is constructed to learn hierarchical spectral features. Moreover, to alleviate the problem of insufficient training samples, unlabeled samples with high confidences are pre-labeled under adaptively spatial constraint. Experimental results on HSIs demonstrate that the proposed method provides encouraging classification performance, especially region uniformity and edge preservation with limited training samples.
Collapse
|
40
|
Quality-Oriented Perceptual HEVC Based on the Spatiotemporal Saliency Detection Model. ENTROPY 2019; 21:e21020165. [PMID: 33266881 PMCID: PMC7514647 DOI: 10.3390/e21020165] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2019] [Revised: 02/04/2019] [Accepted: 02/10/2019] [Indexed: 11/25/2022]
Abstract
Perceptual video coding (PVC) can provide a lower bitrate with the same visual quality compared with traditional H.265/high efficiency video coding (HEVC). In this work, a novel H.265/HEVC-compliant PVC framework is proposed based on the video saliency model. Firstly, both an effective and efficient spatiotemporal saliency model is used to generate a video saliency map. Secondly, a perceptual coding scheme is developed based on the saliency map. A saliency-based quantization control algorithm is proposed to reduce the bitrate. Finally, the simulation results demonstrate that the proposed perceptual coding scheme shows its superiority in objective and subjective tests, achieving up to a 9.46% bitrate reduction with negligible subjective and objective quality loss. The advantage of the proposed method is the high quality adapted for a high-definition video application.
Collapse
|
41
|
|
42
|
Xiao Y, Jiang B, Tu Z, Ma J, Tang J. A prior regularized multi-layer graph ranking model for image saliency computation. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.06.072] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
43
|
Abstract
To improve the invisibility and robustness of the multiplicative watermarking algorithm, an adaptive image watermarking algorithm is proposed based on the visual saliency model and Laplacian distribution in the wavelet domain. The algorithm designs an adaptive multiplicative watermark strength factor by utilizing the energy aggregation of the high-frequency wavelet sub-band, texture masking and visual saliency characteristics. Then, the image blocks with high-energy are selected as the watermark embedding space to implement the imperceptibility of the watermark. In terms of watermark detection, the Laplacian distribution model is used to model the wavelet coefficients, and a blind watermark detection approach is exploited based on the maximum likelihood scheme. Finally, this paper performs the simulation analysis and comparison of the performance of the proposed algorithm. Experimental results show that the proposed algorithm is robust against additive white Gaussian noise, JPEG compression, median filtering, scaling, rotation attack and other attacks.
Collapse
|
44
|
Xu M, Liu Y, Hu H, He F. Find who to look at: Turning from action to saliency. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:4529-4544. [PMID: 29993577 DOI: 10.1109/tip.2018.2837106] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The past decade has witnessed the use of highlevel features in saliency prediction for both videos and images. Unfortunately, the existing saliency prediction methods only handle high-level static features, such as face. In fact, high-level dynamic features (also called actions), such as speaking or head turning, are also extremely attractive to visual attention in videos. Thus, in this paper, we propose a data-driven method for learning to predict the saliency of multiple-face videos, by leveraging both static and dynamic features at high-level. Specifically, we introduce an eye-tracking database, collecting the fixations of 39 subjects viewing 65 multiple-face videos. Through analysis on our database, we find a set of high-level features that cause a face to receive extensive visual attention. These high-level features include the static features of face size, center-bias and head pose, as well as the dynamic features of speaking and head turning. Then, we present the techniques for extracting these high-level features. Afterwards, a novel model, namely multiple hidden Markov model (M-HMM), is developed in our method to enable the transition of saliency among faces. In our MHMM, the saliency transition takes into account both the state of saliency at previous frames and the observed high-level features at the current frame. The experimental results show that the proposed method is superior to other state-of-the-art methods in predicting visual attention on multiple-face videos. Finally, we shed light on a promising implementation of our saliency prediction method in locating the region-of-interest (ROI), for video conference compression with high efficiency video coding (HEVC).
Collapse
|
45
|
Mahdi A, Su M, Schlesinger M, Qin J. A Comparison Study of Saliency Models for Fixation Prediction on Infants and Adults. IEEE Trans Cogn Dev Syst 2018. [DOI: 10.1109/tcds.2017.2696439] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
46
|
Abstract
Foveated sight as observed in some raptor eyes is a motivation for artificial imaging systems requiring both wide fields of view as well as specific embedded regions of higher resolution. These foveated optical imaging systems are applicable to many acquisition and tracking tasks and as such are often required to be relatively portable and operate in real-time. Two approaches to achieve foveation have been explored in the past: optical system design and back-end data processing. In this paper, these previous works are compiled and used to build a framework for analyzing and designing practical foveated imaging systems. While each approach (physical control of optical distortion within the lens design process, and post-processing image re-sampling) has its own pros and cons, it is concluded that a combination of both techniques will further spur the development of more versatile, flexible, and adaptable foveated imaging systems in the future.
Collapse
|
47
|
Efficient 3D Objects Recognition Using Multifoveated Point Clouds. SENSORS 2018; 18:s18072302. [PMID: 30012990 PMCID: PMC6068497 DOI: 10.3390/s18072302] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2018] [Revised: 07/10/2018] [Accepted: 07/11/2018] [Indexed: 11/24/2022]
Abstract
Technological innovations in the hardware of RGB-D sensors have allowed the acquisition of 3D point clouds in real time. Consequently, various applications have arisen related to the 3D world, which are receiving increasing attention from researchers. Nevertheless, one of the main problems that remains is the demand for computationally intensive processing that required optimized approaches to deal with 3D vision modeling, especially when it is necessary to perform tasks in real time. A previously proposed multi-resolution 3D model known as foveated point clouds can be a possible solution to this problem. Nevertheless, this is a model limited to a single foveated structure with context dependent mobility. In this work, we propose a new solution for data reduction and feature detection using multifoveation in the point cloud. Nonetheless, the application of several foveated structures results in a considerable increase of processing since there are intersections between regions of distinct structures, which are processed multiple times. Towards solving this problem, the current proposal brings an approach that avoids the processing of redundant regions, which results in even more reduced processing time. Such approach can be used to identify objects in 3D point clouds, one of the key tasks for real-time applications as robotics vision, with efficient synchronization allowing the validation of the model and verification of its applicability in the context of computer vision. Experimental results demonstrate a performance gain of at least 27.21% in processing time while retaining the main features of the original, and maintaining the recognition quality rate in comparison with state-of-the-art 3D object recognition methods.
Collapse
|
48
|
Kharma N, Mazhurin A, Saigol K, Sabahi F. Adaptable image segmentation via simple pixel classification. Comput Intell 2018. [DOI: 10.1111/coin.12173] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Affiliation(s)
- Nawwaf Kharma
- Department of Electrical and Computer Engineering; Concordia University; Montréal QC Canada
| | | | - Kamil Saigol
- Institute for Robotics and Intelligent Machines; Georgia Institute of Technology; Atlanta GA USA
| | - Farzad Sabahi
- Department of Electrical and Computer Engineering; Concordia University; Montréal QC Canada
| |
Collapse
|
49
|
Wang Z, Ren J, Zhang D, Sun M, Jiang J. A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.01.076] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
50
|
Liu N, Han J, Liu T, Li X. Learning to Predict Eye Fixations via Multiresolution Convolutional Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:392-404. [PMID: 27913362 DOI: 10.1109/tnnls.2016.2628878] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Eye movements in the case of freely viewing natural scenes are believed to be guided by local contrast, global contrast, and top-down visual factors. Although a lot of previous works have explored these three saliency cues for several years, there still exists much room for improvement on how to model them and integrate them effectively. This paper proposes a novel computation model to predict eye fixations, which adopts a multiresolution convolutional neural network (Mr-CNN) to infer these three types of saliency cues from raw image data simultaneously. The proposed Mr-CNN is trained directly from fixation and nonfixation pixels with multiresolution input image regions with different contexts. It utilizes image pixels as inputs and eye fixation points as labels. Then, both the local and global contrasts are learned by fusing information in multiple contexts. Meanwhile, various top-down factors are learned in higher layers. Finally, optimal combination of top-down factors and bottom-up contrasts can be learned to predict eye fixations. The proposed approach significantly outperforms the state-of-the-art methods on several publically available benchmark databases, demonstrating the superiority of Mr-CNN. We also apply our method to the RGB-D image saliency detection problem. Through learning saliency cues induced by depth and RGB information on pixel level jointly and their interactions, our model achieves better performance on predicting eye fixations in RGB-D images.
Collapse
|