1
|
Bosco A, Sanz Diez P, Filippini M, De Vitis M, Fattori P. A focus on the multiple interfaces between action and perception and their neural correlates. Neuropsychologia 2023; 191:108722. [PMID: 37931747 DOI: 10.1016/j.neuropsychologia.2023.108722] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/13/2023] [Accepted: 10/31/2023] [Indexed: 11/08/2023]
Abstract
Successful behaviour relies on the appropriate interplay between action and perception. The well-established dorsal and ventral stream theories depicted two distinct functional pathways for the processes of action and perception, respectively. In physiological conditions, the two pathways closely cooperate in order to produce successful adaptive behaviour. As the coupling between perception and action exists, this requires an interface that is responsible for a common reading of the two functions. Several studies have proposed different types of perception and action interfaces, suggesting their role in the creation of the shared interaction channel. In the present review, we describe three possible perception and action interfaces: i) the motor code, including common coding approaches, ii) attention, and iii) object affordance; we highlight their potential neural correlates. From this overview, a recurrent neural substrate that underlies all these interface functions appears to be crucial: the parieto-frontal circuit. This network is involved in the mirror mechanism which underlies the perception and action interfaces identified as common coding and motor code theories. The same network is also involved in the spotlight of attention and in the encoding of potential action towards objects; these are manifested in the perception and action interfaces for common attention and object affordance, respectively. Within this framework, most studies were dedicated to the description of the role of the inferior parietal lobule; growing evidence, however, suggests that the superior parietal lobule also plays a crucial role in the interplay between action and perception. The present review proposes a novel model that is inclusive of the superior parietal regions and their relative contribution to the different action and perception interfaces.
Collapse
Affiliation(s)
- A Bosco
- Department of Biomedical and Neuromotor Sciences, University of Bologna, Piazza di Porta San Donato 2, 40126, Bologna, Italy; Alma Mater Research Institute For Human-Centered Artificial Intelligence (Alma Human AI), University of Bologna, Via Galliera 3 Bologna, 40121, Bologna, Italy.
| | - P Sanz Diez
- Carl Zeiss Vision International GmbH, Turnstrasse 27, 73430, Aalen, Germany; Institute for Ophthalmic Research, Eberhard Karls University Tuebingen, Elfriede-Aulhorn-Straße 7, 72076, Tuebingen, Germany
| | - M Filippini
- Department of Biomedical and Neuromotor Sciences, University of Bologna, Piazza di Porta San Donato 2, 40126, Bologna, Italy; Alma Mater Research Institute For Human-Centered Artificial Intelligence (Alma Human AI), University of Bologna, Via Galliera 3 Bologna, 40121, Bologna, Italy
| | - M De Vitis
- Department of Biomedical and Neuromotor Sciences, University of Bologna, Piazza di Porta San Donato 2, 40126, Bologna, Italy
| | - P Fattori
- Department of Biomedical and Neuromotor Sciences, University of Bologna, Piazza di Porta San Donato 2, 40126, Bologna, Italy; Alma Mater Research Institute For Human-Centered Artificial Intelligence (Alma Human AI), University of Bologna, Via Galliera 3 Bologna, 40121, Bologna, Italy
| |
Collapse
|
2
|
Farahat A, Effenberger F, Vinck M. A novel feature-scrambling approach reveals the capacity of convolutional neural networks to learn spatial relations. Neural Netw 2023; 167:400-414. [PMID: 37673027 DOI: 10.1016/j.neunet.2023.08.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 07/07/2023] [Accepted: 08/13/2023] [Indexed: 09/08/2023]
Abstract
Convolutional neural networks (CNNs) are one of the most successful computer vision systems to solve object recognition. Furthermore, CNNs have major applications in understanding the nature of visual representations in the human brain. Yet it remains poorly understood how CNNs actually make their decisions, what the nature of their internal representations is, and how their recognition strategies differ from humans. Specifically, there is a major debate about the question of whether CNNs primarily rely on surface regularities of objects, or whether they are capable of exploiting the spatial arrangement of features, similar to humans. Here, we develop a novel feature-scrambling approach to explicitly test whether CNNs use the spatial arrangement of features (i.e. object parts) to classify objects. We combine this approach with a systematic manipulation of effective receptive field sizes of CNNs as well as minimal recognizable configurations (MIRCs) analysis. In contrast to much previous literature, we provide evidence that CNNs are in fact capable of using relatively long-range spatial relationships for object classification. Moreover, the extent to which CNNs use spatial relationships depends heavily on the dataset, e.g. texture vs. sketch. In fact, CNNs even use different strategies for different classes within heterogeneous datasets (ImageNet), suggesting CNNs have a continuous spectrum of classification strategies. Finally, we show that CNNs learn the spatial arrangement of features only up to an intermediate level of granularity, which suggests that intermediate rather than global shape features provide the optimal trade-off between sensitivity and specificity in object classification. These results provide novel insights into the nature of CNN representations and the extent to which they rely on the spatial arrangement of features for object classification.
Collapse
Affiliation(s)
- Amr Farahat
- Ernst Strüngmann Institute for Neuroscience in Cooperation with Max Planck Society, Frankfurt, Germany; Donders Centre for Neuroscience, Department of Neuroinformatics, Radboud University, Nijmegen, The Netherlands.
| | - Felix Effenberger
- Ernst Strüngmann Institute for Neuroscience in Cooperation with Max Planck Society, Frankfurt, Germany; Frankfurt Institute for Advanced Studies, Frankfurt, Germany
| | - Martin Vinck
- Ernst Strüngmann Institute for Neuroscience in Cooperation with Max Planck Society, Frankfurt, Germany; Donders Centre for Neuroscience, Department of Neuroinformatics, Radboud University, Nijmegen, The Netherlands
| |
Collapse
|
3
|
Malik G, Crowder D, Mingolla E. Extreme image transformations affect humans and machines differently. BIOLOGICAL CYBERNETICS 2023; 117:331-343. [PMID: 37310489 PMCID: PMC10600046 DOI: 10.1007/s00422-023-00968-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 05/26/2023] [Indexed: 06/14/2023]
Abstract
Some recent artificial neural networks (ANNs) claim to model aspects of primate neural and human performance data. Their success in object recognition is, however, dependent on exploiting low-level features for solving visual tasks in a way that humans do not. As a result, out-of-distribution or adversarial input is often challenging for ANNs. Humans instead learn abstract patterns and are mostly unaffected by many extreme image distortions. We introduce a set of novel image transforms inspired by neurophysiological findings and evaluate humans and ANNs on an object recognition task. We show that machines perform better than humans for certain transforms and struggle to perform at par with humans on others that are easy for humans. We quantify the differences in accuracy for humans and machines and find a ranking of difficulty for our transforms for human data. We also suggest how certain characteristics of human visual processing can be adapted to improve the performance of ANNs for our difficult-for-machines transforms.
Collapse
Affiliation(s)
- Girik Malik
- Northeastern University, Boston, MA 02115 USA
| | | | | |
Collapse
|
4
|
Zheng S, Huang X, Chen J, Lyu Z, Zheng J, Huang J, Gao H, Liu S, Sun L. UR-Net: An Integrated ResUNet and Attention Based Image Enhancement and Classification Network for Stain-Free White Blood Cells. SENSORS (BASEL, SWITZERLAND) 2023; 23:7605. [PMID: 37688058 PMCID: PMC10490639 DOI: 10.3390/s23177605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 08/08/2023] [Accepted: 08/29/2023] [Indexed: 09/10/2023]
Abstract
The differential count of white blood cells (WBCs) can effectively provide disease information for patients. Existing stained microscopic WBC classification usually requires complex sample-preparation steps, and is easily affected by external conditions such as illumination. In contrast, the inconspicuous nuclei of stain-free WBCs also bring great challenges to WBC classification. As such, image enhancement, as one of the preprocessing methods of image classification, is essential in improving the image qualities of stain-free WBCs. However, traditional or existing convolutional neural network (CNN)-based image enhancement techniques are typically designed as standalone modules aimed at improving the perceptual quality of humans, without considering their impact on advanced computer vision tasks of classification. Therefore, this work proposes a novel model, UR-Net, which consists of an image enhancement network framed by ResUNet with an attention mechanism and a ResNet classification network. The enhancement model is integrated into the classification model for joint training to improve the classification performance for stain-free WBCs. The experimental results demonstrate that compared to the models without image enhancement and previous enhancement and classification models, our proposed model achieved a best classification performance of 83.34% on our stain-free WBC dataset.
Collapse
Affiliation(s)
- Sikai Zheng
- Ministry of Education Key Laboratory of RF Circuits and Systems, Hangzhou Dianzi University, Hangzhou 310018, China; (S.Z.); (J.C.); (Z.L.); (J.Z.); (J.H.); (L.S.)
| | - Xiwei Huang
- Ministry of Education Key Laboratory of RF Circuits and Systems, Hangzhou Dianzi University, Hangzhou 310018, China; (S.Z.); (J.C.); (Z.L.); (J.Z.); (J.H.); (L.S.)
| | - Jin Chen
- Ministry of Education Key Laboratory of RF Circuits and Systems, Hangzhou Dianzi University, Hangzhou 310018, China; (S.Z.); (J.C.); (Z.L.); (J.Z.); (J.H.); (L.S.)
| | - Zefei Lyu
- Ministry of Education Key Laboratory of RF Circuits and Systems, Hangzhou Dianzi University, Hangzhou 310018, China; (S.Z.); (J.C.); (Z.L.); (J.Z.); (J.H.); (L.S.)
| | - Jingwen Zheng
- Ministry of Education Key Laboratory of RF Circuits and Systems, Hangzhou Dianzi University, Hangzhou 310018, China; (S.Z.); (J.C.); (Z.L.); (J.Z.); (J.H.); (L.S.)
| | - Jiye Huang
- Ministry of Education Key Laboratory of RF Circuits and Systems, Hangzhou Dianzi University, Hangzhou 310018, China; (S.Z.); (J.C.); (Z.L.); (J.Z.); (J.H.); (L.S.)
| | - Haijun Gao
- Ministry of Education Key Laboratory of RF Circuits and Systems, Hangzhou Dianzi University, Hangzhou 310018, China; (S.Z.); (J.C.); (Z.L.); (J.Z.); (J.H.); (L.S.)
| | - Shan Liu
- Sichuan Provincial Key Laboratory for Human Disease Gene Study, Sichuan Academy of Medical Sciences & Sichuan Provincial People’s Hospital, University of Electronic Science and Technology of China, Chengdu 610072, China;
| | - Lingling Sun
- Ministry of Education Key Laboratory of RF Circuits and Systems, Hangzhou Dianzi University, Hangzhou 310018, China; (S.Z.); (J.C.); (Z.L.); (J.Z.); (J.H.); (L.S.)
| |
Collapse
|
5
|
Liu J, Zhao X, Zhao K, Goncharov VG, Delhommelle J, Lin J, Guo X. A modulated fingerprint assisted machine learning method for retrieving elastic moduli from resonant ultrasound spectroscopy. Sci Rep 2023; 13:5919. [PMID: 37041266 PMCID: PMC10090122 DOI: 10.1038/s41598-023-33046-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Accepted: 04/06/2023] [Indexed: 04/13/2023] Open
Abstract
We used deep-learning-based models to automatically obtain elastic moduli from resonant ultrasound spectroscopy (RUS) spectra, which conventionally require user intervention of published analysis codes. By strategically converting theoretical RUS spectra into their modulated fingerprints and using them as a dataset to train neural network models, we obtained models that successfully predicted both elastic moduli from theoretical test spectra of an isotropic material and from a measured steel RUS spectrum with up to 9.6% missing resonances. We further trained modulated fingerprint-based models to resolve RUS spectra from yttrium-aluminum-garnet (YAG) ceramic samples with three elastic moduli. The resulting models were capable of retrieving all three elastic moduli from spectra with a maximum of 26% missing frequencies. In summary, our modulated fingerprint method is an efficient tool to transform raw spectroscopy data and train neural network models with high accuracy and resistance to spectra distortion.
Collapse
Affiliation(s)
- Juejing Liu
- Department of Chemistry, Washington State University, Pullman, WA, 99164, USA
- Alexandra Navrotsky Institute for Experimental Thermodynamics, Washington State University, Pullman, WA, 99164, USA
- School of Mechanical and Materials Engineering, Washington State University, Pullman, WA, 99164, USA
| | - Xiaodong Zhao
- Department of Chemistry, Washington State University, Pullman, WA, 99164, USA
- Alexandra Navrotsky Institute for Experimental Thermodynamics, Washington State University, Pullman, WA, 99164, USA
| | - Ke Zhao
- Alexandra Navrotsky Institute for Experimental Thermodynamics, Washington State University, Pullman, WA, 99164, USA
| | - Vitaliy G Goncharov
- Department of Chemistry, Washington State University, Pullman, WA, 99164, USA
- Alexandra Navrotsky Institute for Experimental Thermodynamics, Washington State University, Pullman, WA, 99164, USA
- School of Mechanical and Materials Engineering, Washington State University, Pullman, WA, 99164, USA
| | - Jerome Delhommelle
- Department of Chemistry, University of Massachusetts, Lowell, MA, 01854, USA
| | - Jian Lin
- School of Nuclear Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, Shaanxi, China
| | - Xiaofeng Guo
- Department of Chemistry, Washington State University, Pullman, WA, 99164, USA.
- Alexandra Navrotsky Institute for Experimental Thermodynamics, Washington State University, Pullman, WA, 99164, USA.
- School of Mechanical and Materials Engineering, Washington State University, Pullman, WA, 99164, USA.
| |
Collapse
|
6
|
McGugin RW, Sunday MA, Gauthier I. The neural correlates of domain-general visual ability. Cereb Cortex 2023; 33:4280-4292. [PMID: 36045003 DOI: 10.1093/cercor/bhac342] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 08/01/2022] [Accepted: 08/02/2022] [Indexed: 11/12/2022] Open
Abstract
People vary in their general ability to compare, identify, and remember objects. Research using latent variable modeling identifies a domain-general visual recognition ability (called o) that reflects correlations among different visual tasks and categories. We measure associations between a psychometrically-sensitive measure of o and a neurometrically-sensitive measure of visual sensitivity to shape. We report evidence for distributed neural correlates of o using functional and anatomical regions-of-interest (ROIs) as well as whole brain analyses. Neural selectivity to shape is associated with o in several regions of the ventral pathway, as well as additional foci in parietal and premotor cortex. Multivariate analyses suggest the distributed effects in ventral cortex reflect a common mechanism. The network of brain areas where neural selectivity predicts o is similar to that evoked by the most informative features for object recognition in prior work, showing convergence of 2 different approaches on identifying areas that support the best object recognition performance. Because o predicts performance across many visual tasks for both novel and familiar objects, we propose that o could predict the magnitude of neural changes in task-relevant areas following experience with specific task and object category.
Collapse
Affiliation(s)
- Rankin W McGugin
- Department of Psychology, Vanderbilt University, 301 Wilson Hall, Nashville, TN 37240, United States
| | - Mackenzie A Sunday
- Department of Psychology, Vanderbilt University, 301 Wilson Hall, Nashville, TN 37240, United States
| | - Isabel Gauthier
- Department of Psychology, Vanderbilt University, 301 Wilson Hall, Nashville, TN 37240, United States
| |
Collapse
|
7
|
Sadagopan S, Kar M, Parida S. Quantitative models of auditory cortical processing. Hear Res 2023; 429:108697. [PMID: 36696724 PMCID: PMC9928778 DOI: 10.1016/j.heares.2023.108697] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 12/17/2022] [Accepted: 01/12/2023] [Indexed: 01/15/2023]
Abstract
To generate insight from experimental data, it is critical to understand the inter-relationships between individual data points and place them in context within a structured framework. Quantitative modeling can provide the scaffolding for such an endeavor. Our main objective in this review is to provide a primer on the range of quantitative tools available to experimental auditory neuroscientists. Quantitative modeling is advantageous because it can provide a compact summary of observed data, make underlying assumptions explicit, and generate predictions for future experiments. Quantitative models may be developed to characterize or fit observed data, to test theories of how a task may be solved by neural circuits, to determine how observed biophysical details might contribute to measured activity patterns, or to predict how an experimental manipulation would affect neural activity. In complexity, quantitative models can range from those that are highly biophysically realistic and that include detailed simulations at the level of individual synapses, to those that use abstract and simplified neuron models to simulate entire networks. Here, we survey the landscape of recently developed models of auditory cortical processing, highlighting a small selection of models to demonstrate how they help generate insight into the mechanisms of auditory processing. We discuss examples ranging from models that use details of synaptic properties to explain the temporal pattern of cortical responses to those that use modern deep neural networks to gain insight into human fMRI data. We conclude by discussing a biologically realistic and interpretable model that our laboratory has developed to explore aspects of vocalization categorization in the auditory pathway.
Collapse
Affiliation(s)
- Srivatsun Sadagopan
- Department of Neurobiology, University of Pittsburgh, Pittsburgh, PA, USA; Center for Neuroscience, University of Pittsburgh, Pittsburgh, PA, USA; Center for the Neural Basis of Cognition, University of Pittsburgh, Pittsburgh, PA, USA; Department of Bioengineering, University of Pittsburgh, Pittsburgh, PA, USA; Department of Communication Science and Disorders, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Manaswini Kar
- Department of Neurobiology, University of Pittsburgh, Pittsburgh, PA, USA; Center for Neuroscience, University of Pittsburgh, Pittsburgh, PA, USA; Center for the Neural Basis of Cognition, University of Pittsburgh, Pittsburgh, PA, USA
| | - Satyabrata Parida
- Department of Neurobiology, University of Pittsburgh, Pittsburgh, PA, USA; Center for Neuroscience, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
8
|
Castellotti S, D’Agostino O, Del Viva MM. Fast discrimination of fragmentary images: the role of local optimal information. Front Hum Neurosci 2023; 17:1049615. [PMID: 36845876 PMCID: PMC9945129 DOI: 10.3389/fnhum.2023.1049615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 01/18/2023] [Indexed: 02/11/2023] Open
Abstract
In naturalistic conditions, objects in the scene may be partly occluded and the visual system has to recognize the whole image based on the little information contained in some visible fragments. Previous studies demonstrated that humans can successfully recognize severely occluded images, but the underlying mechanisms occurring in the early stages of visual processing are still poorly understood. The main objective of this work is to investigate the contribution of local information contained in a few visible fragments to image discrimination in fast vision. It has been already shown that a specific set of features, predicted by a constrained maximum-entropy model to be optimal carriers of information (optimal features), are used to build simplified early visual representations (primal sketch) that are sufficient for fast image discrimination. These features are also considered salient by the visual system and can guide visual attention when presented isolated in artificial stimuli. Here, we explore whether these local features also play a significant role in more natural settings, where all existing features are kept, but the overall available information is drastically reduced. Indeed, the task requires discrimination of naturalistic images based on a very brief presentation (25 ms) of a few small visible image fragments. In the main experiment, we reduced the possibility to perform the task based on global-luminance positional cues by presenting randomly inverted-contrast images, and we measured how much observers' performance relies on the local features contained in the fragments or on global information. The size and the number of fragments were determined in two preliminary experiments. Results show that observers are very skilled in fast image discrimination, even when a drastic occlusion is applied. When observers cannot rely on the position of global-luminance information, the probability of correct discrimination increases when the visible fragments contain a high number of optimal features. These results suggest that such optimal local information contributes to the successful reconstruction of naturalistic images even in challenging conditions.
Collapse
|
9
|
Ayzenberg V, Behrmann M. Does the brain's ventral visual pathway compute object shape? Trends Cogn Sci 2022; 26:1119-1132. [PMID: 36272937 DOI: 10.1016/j.tics.2022.09.019] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 09/22/2022] [Accepted: 09/26/2022] [Indexed: 11/11/2022]
Abstract
A rich behavioral literature has shown that human object recognition is supported by a representation of shape that is tolerant to variations in an object's appearance. Such 'global' shape representations are achieved by describing objects via the spatial arrangement of their local features, or structure, rather than by the appearance of the features themselves. However, accumulating evidence suggests that the ventral visual pathway - the primary substrate underlying object recognition - may not represent global shape. Instead, ventral representations may be better described as a basis set of local image features. We suggest that this evidence forces a reevaluation of the role of the ventral pathway in object perception and posits a broader network for shape perception that encompasses contributions from the dorsal pathway.
Collapse
Affiliation(s)
- Vladislav Ayzenberg
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Psychology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| | - Marlene Behrmann
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Psychology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA; The Department of Ophthalmology, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| |
Collapse
|
10
|
Xu Y, Vaziri-Pashkam M. Understanding transformation tolerant visual object representations in the human brain and convolutional neural networks. Neuroimage 2022; 263:119635. [PMID: 36116617 PMCID: PMC11283825 DOI: 10.1016/j.neuroimage.2022.119635] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 09/12/2022] [Accepted: 09/14/2022] [Indexed: 11/16/2022] Open
Abstract
Forming transformation-tolerant object representations is critical to high-level primate vision. Despite its significance, many details of tolerance in the human brain remain unknown. Likewise, despite the ability of convolutional neural networks (CNNs) to exhibit human-like object categorization performance, whether CNNs form tolerance similar to that of the human brain is unknown. Here we provide the first comprehensive documentation and comparison of three tolerance measures in the human brain and CNNs. We measured fMRI responses from human ventral visual areas to real-world objects across both Euclidean and non-Euclidean feature changes. In single fMRI voxels in higher visual areas, we observed robust object response rank-order preservation across feature changes. This is indicative of functional smoothness in tolerance at the fMRI meso-scale level that has never been reported before. At the voxel population level, we found highly consistent object representational structure across feature changes towards the end of ventral processing. Rank-order preservation, consistency, and a third tolerance measure, cross-decoding success (i.e., a linear classifier's ability to generalize performance across feature changes) showed an overall tight coupling. These tolerance measures were in general lower for Euclidean than non-Euclidean feature changes in lower visual areas, but increased over the course of ventral processing for all feature changes. These characteristics of tolerance, however, were absent in eight CNNs pretrained with ImageNet images with varying network architecture, depth, the presence/absence of recurrent processing, or whether a network was pretrained with the original or stylized ImageNet images that encouraged shape processing. CNNs do not appear to develop the same kind of tolerance as the human brain over the course of visual processing.
Collapse
Affiliation(s)
- Yaoda Xu
- Psychology Department, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|
11
|
Sp A. Trailblazers in Neuroscience: Using compositionality to understand how parts combine in whole objects. Eur J Neurosci 2022; 56:4378-4392. [PMID: 35760552 PMCID: PMC10084036 DOI: 10.1111/ejn.15746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Revised: 06/09/2022] [Accepted: 06/16/2022] [Indexed: 11/27/2022]
Abstract
A fundamental question for any visual system is whether its image representation can be understood in terms of its components. Decomposing any image into components is challenging because there are many possible decompositions with no common dictionary, and enumerating them leads to a combinatorial explosion. Even in perception, many objects are readily seen as containing parts, but there are many exceptions. These exceptions include objects that are not perceived as containing parts, properties like symmetry that cannot be localized to any single part, and also special categories like words and faces whose perception is widely believed to be holistic. Here, I describe a novel approach we have used to address these issues and evaluate compositionality at the behavioral and neural levels. The key design principle is to create a large number of objects by combining a small number of pre-defined components in all possible ways. This allows for building component-based models that explain whole objects using a combination of these components. Importantly, any systematic error in model fits can be used to detect the presence of emergent or holistic properties. Using this approach, we have found that whole object representations are surprisingly predictable from their components, that some components are preferred to others in perception, and that emergent properties can be discovered or explained using compositional models. Thus, compositionality is a powerful approach for understanding how whole objects relate to their parts.
Collapse
Affiliation(s)
- Arun Sp
- Centre for Neuroscience, Indian Institute of Science Bangalore
| |
Collapse
|
12
|
Abstract
Vision and learning have long been considered to be two areas of research linked only distantly. However, recent developments in vision research have changed the conceptual definition of vision from a signal-evaluating process to a goal-oriented interpreting process, and this shift binds learning, together with the resulting internal representations, intimately to vision. In this review, we consider various types of learning (perceptual, statistical, and rule/abstract) associated with vision in the past decades and argue that they represent differently specialized versions of the fundamental learning process, which must be captured in its entirety when applied to complex visual processes. We show why the generalized version of statistical learning can provide the appropriate setup for such a unified treatment of learning in vision, what computational framework best accommodates this kind of statistical learning, and what plausible neural scheme could feasibly implement this framework. Finally, we list the challenges that the field of statistical learning faces in fulfilling the promise of being the right vehicle for advancing our understanding of vision in its entirety. Expected final online publication date for the Annual Review of Vision Science, Volume 8 is September 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- József Fiser
- Department of Cognitive Science, Center for Cognitive Computation, Central European University, Vienna 1100, Austria;
| | - Gábor Lengyel
- Department of Brain and Cognitive Sciences, University of Rochester, Rochester, New York 14627, USA
| |
Collapse
|
13
|
Guerin F. Projection: a mechanism for human-like reasoning in Artificial Intelligence. J EXP THEOR ARTIF IN 2022. [DOI: 10.1080/0952813x.2022.2078889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- F. Guerin
- Department of Computer Science, University of Surrey, Guildford, UK
| |
Collapse
|
14
|
|
15
|
Ayzenberg V, Kamps FS, Dilks DD, Lourenco SF. Skeletal representations of shape in the human visual cortex. Neuropsychologia 2022; 164:108092. [PMID: 34801519 PMCID: PMC9840386 DOI: 10.1016/j.neuropsychologia.2021.108092] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Revised: 11/07/2021] [Accepted: 11/17/2021] [Indexed: 01/17/2023]
Abstract
Shape perception is crucial for object recognition. However, it remains unknown exactly how shape information is represented and used by the visual system. Here, we tested the hypothesis that the visual system represents object shape via a skeletal structure. Using functional magnetic resonance imaging (fMRI) and representational similarity analysis (RSA), we found that a model of skeletal similarity explained significant unique variance in the response profiles of V3 and LO. Moreover, the skeletal model remained predictive in these regions even when controlling for other models of visual similarity that approximate low-to high-level visual features (i.e., Gabor-jet, GIST, HMAX, and AlexNet), and across different surface forms, a manipulation that altered object contours while preserving the underlying skeleton. Together, these findings shed light on shape processing in human vision, as well as the computational properties of V3 and LO. We discuss how these regions may support two putative roles of shape skeletons: namely, perceptual organization and object recognition.
Collapse
Affiliation(s)
- Vladislav Ayzenberg
- Department of Psychology, Carnegie Mellon University, USA,Corresponding author: (V. Ayzenberg)
| | - Frederik S. Kamps
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, USA
| | | | - Stella F. Lourenco
- Department of Psychology, Emory University, USA,Corresponding author: (S.F. Lourenco)
| |
Collapse
|
16
|
General intelligence disentangled via a generality metric for natural and artificial intelligence. Sci Rep 2021; 11:22822. [PMID: 34819537 PMCID: PMC8613222 DOI: 10.1038/s41598-021-01997-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 11/01/2021] [Indexed: 11/08/2022] Open
Abstract
Success in all sorts of situations is the most classical interpretation of general intelligence. Under limited resources, however, the capability of an agent must necessarily be limited too, and generality needs to be understood as comprehensive performance up to a level of difficulty. The degree of generality then refers to the way an agent's capability is distributed as a function of task difficulty. This dissects the notion of general intelligence into two non-populational measures, generality and capability, which we apply to individuals and groups of humans, other animals and AI systems, on several cognitive and perceptual tests. Our results indicate that generality and capability can decouple at the individual level: very specialised agents can show high capability and vice versa. The metrics also decouple at the population level, and we rarely see diminishing returns in generality for those groups of high capability. We relate the individual measure of generality to traditional notions of general intelligence and cognitive efficiency in humans, collectives, non-human animals and machines. The choice of the difficulty function now plays a prominent role in this new conception of generality, which brings a quantitative tool for shedding light on long-standing questions about the evolution of general intelligence and the evaluation of progress in Artificial General Intelligence.
Collapse
|
17
|
Oculo-retinal dynamics can explain the perception of minimal recognizable configurations. Proc Natl Acad Sci U S A 2021; 118:2022792118. [PMID: 34417308 DOI: 10.1073/pnas.2022792118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Natural vision is a dynamic and continuous process. Under natural conditions, visual object recognition typically involves continuous interactions between ocular motion and visual contrasts, resulting in dynamic retinal activations. In order to identify the dynamic variables that participate in this process and are relevant for image recognition, we used a set of images that are just above and below the human recognition threshold and whose recognition typically requires >2 s of viewing. We recorded eye movements of participants while attempting to recognize these images within trials lasting 3 s. We then assessed the activation dynamics of retinal ganglion cells resulting from ocular dynamics using a computational model. We found that while the saccadic rate was similar between recognized and unrecognized trials, the fixational ocular speed was significantly larger for unrecognized trials. Interestingly, however, retinal activation level was significantly lower during these unrecognized trials. We used retinal activation patterns and oculomotor parameters of each fixation to train a binary classifier, classifying recognized from unrecognized trials. Only retinal activation patterns could predict recognition, reaching 80% correct classifications on the fourth fixation (on average, ∼2.5 s from trial onset). We thus conclude that the information that is relevant for visual perception is embedded in the dynamic interactions between the oculomotor sequence and the image. Hence, our results suggest that ocular dynamics play an important role in recognition and that understanding the dynamics of retinal activation is crucial for understanding natural vision.
Collapse
|
18
|
Zhang C, Duan XH, Wang LY, Li YL, Yan B, Hu GE, Zhang RY, Tong L. Dissociable Neural Representations of Adversarially Perturbed Images in Convolutional Neural Networks and the Human Brain. Front Neuroinform 2021; 15:677925. [PMID: 34421567 PMCID: PMC8375771 DOI: 10.3389/fninf.2021.677925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 06/28/2021] [Indexed: 11/28/2022] Open
Abstract
Despite the remarkable similarities between convolutional neural networks (CNN) and the human brain, CNNs still fall behind humans in many visual tasks, indicating that there still exist considerable differences between the two systems. Here, we leverage adversarial noise (AN) and adversarial interference (AI) images to quantify the consistency between neural representations and perceptual outcomes in the two systems. Humans can successfully recognize AI images as the same categories as their corresponding regular images but perceive AN images as meaningless noise. In contrast, CNNs can recognize AN images similar as corresponding regular images but classify AI images into wrong categories with surprisingly high confidence. We use functional magnetic resonance imaging to measure brain activity evoked by regular and adversarial images in the human brain, and compare it to the activity of artificial neurons in a prototypical CNN-AlexNet. In the human brain, we find that the representational similarity between regular and adversarial images largely echoes their perceptual similarity in all early visual areas. In AlexNet, however, the neural representations of adversarial images are inconsistent with network outputs in all intermediate processing layers, providing no neural foundations for the similarities at the perceptual level. Furthermore, we show that voxel-encoding models trained on regular images can successfully generalize to the neural responses to AI images but not AN images. These remarkable differences between the human brain and AlexNet in representation-perception association suggest that future CNNs should emulate both behavior and the internal neural presentations of the human brain.
Collapse
Affiliation(s)
- Chi Zhang
- Henan Key Laboratory of Imaging and Intelligent Processing, PLA Strategic Support Force Information Engineering University, Zhengzhou, China
| | - Xiao-Han Duan
- Henan Key Laboratory of Imaging and Intelligent Processing, PLA Strategic Support Force Information Engineering University, Zhengzhou, China
| | - Lin-Yuan Wang
- Henan Key Laboratory of Imaging and Intelligent Processing, PLA Strategic Support Force Information Engineering University, Zhengzhou, China
| | - Yong-Li Li
- People’s Hospital of Henan Province, Zhengzhou, China
| | - Bin Yan
- Henan Key Laboratory of Imaging and Intelligent Processing, PLA Strategic Support Force Information Engineering University, Zhengzhou, China
| | - Guo-En Hu
- Henan Key Laboratory of Imaging and Intelligent Processing, PLA Strategic Support Force Information Engineering University, Zhengzhou, China
| | - Ru-Yuan Zhang
- Institute of Psychology and Behavioral Science, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Li Tong
- Henan Key Laboratory of Imaging and Intelligent Processing, PLA Strategic Support Force Information Engineering University, Zhengzhou, China
| |
Collapse
|
19
|
Sensitivity to geometric shape regularity in humans and baboons: A putative signature of human singularity. Proc Natl Acad Sci U S A 2021; 118:2023123118. [PMID: 33846254 DOI: 10.1073/pnas.2023123118] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
Among primates, humans are special in their ability to create and manipulate highly elaborate structures of language, mathematics, and music. Here we show that this sensitivity to abstract structure is already present in a much simpler domain: the visual perception of regular geometric shapes such as squares, rectangles, and parallelograms. We asked human subjects to detect an intruder shape among six quadrilaterals. Although the intruder was always defined by an identical amount of displacement of a single vertex, the results revealed a geometric regularity effect: detection was considerably easier when either the base shape or the intruder was a regular figure comprising right angles, parallelism, or symmetry rather than a more irregular shape. This effect was replicated in several tasks and in all human populations tested, including uneducated Himba adults and French kindergartners. Baboons, however, showed no such geometric regularity effect, even after extensive training. Baboon behavior was captured by convolutional neural networks (CNNs), but neither CNNs nor a variational autoencoder captured the human geometric regularity effect. However, a symbolic model, based on exact properties of Euclidean geometry, closely fitted human behavior. Our results indicate that the human propensity for symbolic abstraction permeates even elementary shape perception. They suggest a putative signature of human singularity and provide a challenge for nonsymbolic models of human shape perception.
Collapse
|
20
|
Examining the Coding Strength of Object Identity and Nonidentity Features in Human Occipito-Temporal Cortex and Convolutional Neural Networks. J Neurosci 2021; 41:4234-4252. [PMID: 33789916 DOI: 10.1523/jneurosci.1993-20.2021] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 03/12/2021] [Accepted: 03/15/2021] [Indexed: 12/17/2022] Open
Abstract
A visual object is characterized by multiple visual features, including its identity, position and size. Despite the usefulness of identity and nonidentity features in vision and their joint coding throughout the primate ventral visual processing pathway, they have so far been studied relatively independently. Here in both female and male human participants, the coding of identity and nonidentity features was examined together across the human ventral visual pathway. The nonidentity features tested included two Euclidean features (position and size) and two non-Euclidean features (image statistics and spatial frequency (SF) content of an image). Overall, identity representation increased and nonidentity feature representation decreased along the ventral visual pathway, with identity outweighing the non-Euclidean but not the Euclidean features at higher levels of visual processing. In 14 convolutional neural networks (CNNs) pretrained for object categorization with varying architecture, depth, and with/without recurrent processing, nonidentity feature representation showed an initial large increase from early to mid-stage of processing, followed by a decrease at later stages of processing, different from brain responses. Additionally, from lower to higher levels of visual processing, position became more underrepresented and image statistics and SF became more overrepresented compared with identity in CNNs than in the human brain. Similar results were obtained in a CNN trained with stylized images that emphasized shape representations. Overall, by measuring the coding strength of object identity and nonidentity features together, our approach provides a new tool for characterizing feature coding in the human brain and the correspondence between the brain and CNNs.SIGNIFICANCE STATEMENT This study examined the coding strength of object identity and four types of nonidentity features along the human ventral visual processing pathway and compared brain responses with those of 14 convolutional neural networks (CNNs) pretrained to perform object categorization. Overall, identity representation increased and nonidentity feature representation decreased along the ventral visual pathway, with some notable differences among the different nonidentity features. CNNs differed from the brain in a number of aspects in their representations of identity and nonidentity features over the course of visual processing. Our approach provides a new tool for characterizing feature coding in the human brain and the correspondence between the brain and CNNs.
Collapse
|
21
|
Funke CM, Borowski J, Stosio K, Brendel W, Wallis TSA, Bethge M. Five points to check when comparing visual perception in humans and machines. J Vis 2021; 21:16. [PMID: 33724362 PMCID: PMC7980041 DOI: 10.1167/jov.21.3.16] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Accepted: 12/02/2020] [Indexed: 11/24/2022] Open
Abstract
With the rise of machines to human-level performance in complex recognition tasks, a growing amount of work is directed toward comparing information processing in humans and machines. These studies are an exciting chance to learn about one system by studying the other. Here, we propose ideas on how to design, conduct, and interpret experiments such that they adequately support the investigation of mechanisms when comparing human and machine perception. We demonstrate and apply these ideas through three case studies. The first case study shows how human bias can affect the interpretation of results and that several analytic tools can help to overcome this human reference point. In the second case study, we highlight the difference between necessary and sufficient mechanisms in visual reasoning tasks. Thereby, we show that contrary to previous suggestions, feedback mechanisms might not be necessary for the tasks in question. The third case study highlights the importance of aligning experimental conditions. We find that a previously observed difference in object recognition does not hold when adapting the experiment to make conditions more equitable between humans and machines. In presenting a checklist for comparative studies of visual reasoning in humans and machines, we hope to highlight how to overcome potential pitfalls in design and inference.
Collapse
Affiliation(s)
| | | | - Karolina Stosio
- University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Tübingen and Berlin, Germany
- Volkswagen Group Machine Learning Research Lab, Munich, Germany
| | - Wieland Brendel
- University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Tübingen and Berlin, Germany
- Werner Reichardt Centre for Integrative Neuroscience, Tübingen, Germany
| | - Thomas S A Wallis
- University of Tübingen, Tübingen, Germany
- Present address: Amazon.com, Tübingen
| | - Matthias Bethge
- University of Tübingen, Tübingen, Germany
- Bernstein Center for Computational Neuroscience, Tübingen and Berlin, Germany
- Werner Reichardt Centre for Integrative Neuroscience, Tübingen, Germany
| |
Collapse
|
22
|
Saxe A, Nelli S, Summerfield C. If deep learning is the answer, what is the question? Nat Rev Neurosci 2020; 22:55-67. [PMID: 33199854 DOI: 10.1038/s41583-020-00395-8] [Citation(s) in RCA: 112] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/02/2020] [Indexed: 11/09/2022]
Abstract
Neuroscience research is undergoing a minor revolution. Recent advances in machine learning and artificial intelligence research have opened up new ways of thinking about neural computation. Many researchers are excited by the possibility that deep neural networks may offer theories of perception, cognition and action for biological brains. This approach has the potential to radically reshape our approach to understanding neural systems, because the computations performed by deep networks are learned from experience, and not endowed by the researcher. If so, how can neuroscientists use deep networks to model and understand biological brains? What is the outlook for neuroscientists who seek to characterize computations or neural codes, or who wish to understand perception, attention, memory and executive functions? In this Perspective, our goal is to offer a road map for systems neuroscience research in the age of deep learning. We discuss the conceptual and methodological challenges of comparing behaviour, learning dynamics and neural representations in artificial and biological systems, and we highlight new research questions that have emerged for neuroscience as a direct consequence of recent advances in machine learning.
Collapse
Affiliation(s)
- Andrew Saxe
- Department of Experimental Psychology, University of Oxford, Oxford, UK.
| | - Stephanie Nelli
- Department of Experimental Psychology, University of Oxford, Oxford, UK.
| | | |
Collapse
|
23
|
Abstract
Does the human mind resemble the machines that can behave like it? Biologically inspired machine-learning systems approach "human-level" accuracy in an astounding variety of domains, and even predict human brain activity-raising the exciting possibility that such systems represent the world like we do. However, even seemingly intelligent machines fail in strange and "unhumanlike" ways, threatening their status as models of our minds. How can we know when human-machine behavioral differences reflect deep disparities in their underlying capacities, vs. when such failures are only superficial or peripheral? This article draws on a foundational insight from cognitive science-the distinction between performance and competence-to encourage "species-fair" comparisons between humans and machines. The performance/competence distinction urges us to consider whether the failure of a system to behave as ideally hypothesized, or the failure of one creature to behave like another, arises not because the system lacks the relevant knowledge or internal capacities ("competence"), but instead because of superficial constraints on demonstrating that knowledge ("performance"). I argue that this distinction has been neglected by research comparing human and machine behavior, and that it should be essential to any such comparison. Focusing on the domain of image classification, I identify three factors contributing to the species-fairness of human-machine comparisons, extracted from recent work that equates such constraints. Species-fair comparisons level the playing field between natural and artificial intelligence, so that we can separate more superficial differences from those that may be deep and enduring.
Collapse
Affiliation(s)
- Chaz Firestone
- Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218
| |
Collapse
|
24
|
Costela FM, Woods RL. The Impact of Field of View on Understanding of a Movie Is Reduced by Magnifying Around the Center of Interest. Transl Vis Sci Technol 2020; 9:6. [PMID: 32855853 PMCID: PMC7422781 DOI: 10.1167/tvst.9.8.6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 05/12/2020] [Indexed: 01/26/2023] Open
Abstract
Purpose Magnification is commonly used to reduce the impact of impaired central vision. However, magnification limits the field of view (FoV) which may make it difficult to follow the story. Most people with normal vision look in about the same place at about the same time, the center of interest (COI), when watching “Hollywood” movies. We hypothesized that if the FoV was centered at the COI, then this view would provide more useful information than either the original image center or an unrelated view location (the COI locations from a different video clip) as the FoV reduced. Methods The FoV was varied between 100% (original) and 3%. To measure video comprehension as the FoV reduced, subjects described 30-second video clips in response to two open-ended questions. A computational, natural-language approach was used to provide an information acquisition (IA) score. Results The IA scores reduced as the FoV decreased. When the FoV was around the COI, subjects were better able to understand the content of the video clips (higher IA scores) as the FoV decreased than the other conditions. Thus, magnification around the COI may serve as a better video enhancement approach than simple magnification of the image center. Conclusions These results have implications for future image processing and scene viewing, which may help people with central vision loss view directed dynamic visual content (“Hollywood” movies). Translational Relevance Our results are promising for the use of magnification around the COI as a vision rehabilitation aid for people with central vision loss.
Collapse
Affiliation(s)
- Francisco M Costela
- Schepens Eye Research Institute, Massachusetts Eye and Ear, Boston, MA, USA.,Department of Ophthalmology, Harvard Medical School, Boston, MA, USA
| | - Russell L Woods
- Schepens Eye Research Institute, Massachusetts Eye and Ear, Boston, MA, USA.,Department of Ophthalmology, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
25
|
Ben-Yosef G, Kreiman G, Ullman S. Minimal videos: Trade-off between spatial and temporal information in human and machine vision. Cognition 2020; 201:104263. [PMID: 32325309 DOI: 10.1016/j.cognition.2020.104263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Revised: 03/03/2020] [Accepted: 03/05/2020] [Indexed: 11/25/2022]
Abstract
Objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. Human recognition in minimal videos is invariably accompanied by full interpretation of the internal components of the video. State-of-the-art deep convolutional networks for dynamic recognition cannot replicate human behavior in these configurations. The gap between human and machine vision demonstrated here is due to critical mechanisms for full spatiotemporal interpretation that are lacking in current computational models.
Collapse
Affiliation(s)
- Guy Ben-Yosef
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | - Gabriel Kreiman
- Children's Hospital, Harvard Medical School, Boston, MA 021155, USA; Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Shimon Ullman
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel; Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
26
|
|
27
|
Han Y, Roig G, Geiger G, Poggio T. Scale and translation-invariance for novel objects in human vision. Sci Rep 2020; 10:1411. [PMID: 31996698 PMCID: PMC6989457 DOI: 10.1038/s41598-019-57261-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Accepted: 12/19/2019] [Indexed: 11/09/2022] Open
Abstract
Though the range of invariance in recognition of novel objects is a basic aspect of human vision, its characterization has remained surprisingly elusive. Here we report tolerance to scale and position changes in one-shot learning by measuring recognition accuracy of Korean letters presented in a flash to non-Korean subjects who had no previous experience with Korean letters. We found that humans have significant scale-invariance after only a single exposure to a novel object. The range of translation-invariance is limited, depending on the size and position of presented objects. To understand the underlying brain computation associated with the invariance properties, we compared experimental data with computational modeling results. Our results suggest that to explain invariant recognition of objects by humans, neural network models should explicitly incorporate built-in scale-invariance, by encoding different scale channels as well as eccentricity-dependent representations captured by neurons' receptive field sizes and sampling density that change with eccentricity. Our psychophysical experiments and related simulations strongly suggest that the human visual system uses a computational strategy that differs in some key aspects from current deep learning architectures, being more data efficient and relying more critically on eye-movements.
Collapse
Affiliation(s)
- Yena Han
- Center for Brains, Minds and Machines, MIT, 77 Massachusetts Ave, Cambridge, MA, 02139, United States of America.
| | - Gemma Roig
- Center for Brains, Minds and Machines, MIT, 77 Massachusetts Ave, Cambridge, MA, 02139, United States of America
- Computer Science Department, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Gad Geiger
- Center for Brains, Minds and Machines, MIT, 77 Massachusetts Ave, Cambridge, MA, 02139, United States of America
| | - Tomaso Poggio
- Center for Brains, Minds and Machines, MIT, 77 Massachusetts Ave, Cambridge, MA, 02139, United States of America
| |
Collapse
|
28
|
Sarvadevabhatla RK, Surya S, Mittal T, Babu RV. Pictionary-Style Word Guessing on Hand-Drawn Object Sketches: Dataset, Analysis and Deep Network Models. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2020; 42:221-231. [PMID: 30369439 DOI: 10.1109/tpami.2018.2877996] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The ability of intelligent agents to play games in human-like fashion is popularly considered a benchmark of progress in Artificial Intelligence. In our work, we introduce the first computational model aimed at Pictionary, the popular word-guessing social game. We first introduce Sketch-QA, a guessing task. Styled after Pictionary, Sketch-QA uses incrementally accumulated sketch stroke sequences as visual data. Sketch-QA involves asking a fixed question ("What object is being drawn?") and gathering open-ended guess-words from human guessers. We analyze the resulting dataset and present many interesting findings therein. To mimic Pictionary-style guessing, we propose a deep neural model which generates guess-words in response to temporally evolving human-drawn object sketches. Our model even makes human-like mistakes while guessing, thus amplifying the human mimicry factor. We evaluate our model on the large-scale guess-word dataset generated via Sketch-QA task and compare with various baselines. We also conduct a Visual Turing Test to obtain human impressions of the guess-words generated by humans and our model. Experimental results demonstrate the promise of our approach for Pictionary and similarly themed games.
Collapse
|
29
|
Abstract
Artificial vision has often been described as one of the key remaining challenges to be solved before machines can act intelligently. Recent developments in a branch of machine learning known as deep learning have catalyzed impressive gains in machine vision—giving a sense that the problem of vision is getting closer to being solved. The goal of this review is to provide a comprehensive overview of recent deep learning developments and to critically assess actual progress toward achieving human-level visual intelligence. I discuss the implications of the successes and limitations of modern machine vision algorithms for biological vision and the prospect for neuroscience to inform the design of future artificial vision systems.
Collapse
Affiliation(s)
- Thomas Serre
- Department of Cognitive Linguistic and Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, Rhode Island 02818, USA
| |
Collapse
|
30
|
Arguin M, Marleau I, Aubin M, Zahabi S, Leek EC. A surface-based code contributes to visual shape perception. J Vis 2019; 19:6. [PMID: 31509602 DOI: 10.1167/19.11.6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Considerable uncertainty remains regarding the types of features human vision uses for shape representation. Visual-search experiments are reported which assessed the hypothesis of a surface-based (i.e., edge-bounded polygons) code for shape representation in human vision. The results indicate slower search rates and/or longer response times when the target shape shares its constituent surfaces with distractors (conjunction condition) than when the target surfaces are unique in the display (nonconjunction condition). This demonstration is made using test conditions that strictly control any potential artifact pertaining to target-distractor similarity. The surface-based code suggested by this surface-conjunction effect is strictly 2-D, since the effect occurs even when the surfaces are shared between the target and distractors in the 2-D image but not in their 3-D instantiation. Congruently, this latter finding is unaltered by manipulations of the richness of the depth information offered by the stimuli. It is proposed that human vision uses a 2-D surface-based code for shape representation which, considering other key findings in the field, probably coexists with an alternative representation mode based on a type of structural description that can integrate information pertaining to the 3-D aspect of shapes.
Collapse
Affiliation(s)
- Martin Arguin
- Centre de Recherche en Neuropsychologie Expérimentale et Cognition, Département de psychologie, Université de Montréal, Montréal, Canada
| | - Ian Marleau
- CISSS de la Montérégie-Ouest, Installation Longueuil, Longueuil, Canada
| | - Mercédès Aubin
- CÉGEP de Jonquière, Département des Sciences humaines, Jonquière, Canada
| | - Sacha Zahabi
- Centre de Recherche en Neuropsychologie Expérimentale et Cognition, Département de psychologie, Université de Montréal, Montréal, Canada
| | - E Charles Leek
- School of Psychology, Institute of Life and Human Sciences, University of Liverpool, Liverpool, UK
| |
Collapse
|
31
|
Ayzenberg V, Lourenco SF. Skeletal descriptions of shape provide unique perceptual information for object recognition. Sci Rep 2019; 9:9359. [PMID: 31249321 PMCID: PMC6597715 DOI: 10.1038/s41598-019-45268-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 05/29/2019] [Indexed: 11/17/2022] Open
Abstract
With seemingly little effort, humans can both identify an object across large changes in orientation and extend category membership to novel exemplars. Although researchers argue that object shape is crucial in these cases, there are open questions as to how shape is represented for object recognition. Here we tested whether the human visual system incorporates a three-dimensional skeletal descriptor of shape to determine an object's identity. Skeletal models not only provide a compact description of an object's global shape structure, but also provide a quantitative metric by which to compare the visual similarity between shapes. Our results showed that a model of skeletal similarity explained the greatest amount of variance in participants' object dissimilarity judgments when compared with other computational models of visual similarity (Experiment 1). Moreover, parametric changes to an object's skeleton led to proportional changes in perceived similarity, even when controlling for another model of structure (Experiment 2). Importantly, participants preferentially categorized objects by their skeletons across changes to local shape contours and non-accidental properties (Experiment 3). Our findings highlight the importance of skeletal structure in vision, not only as a shape descriptor, but also as a diagnostic cue of object identity.
Collapse
|
32
|
Modelling face memory reveals task-generalizable representations. Nat Hum Behav 2019; 3:817-826. [PMID: 31209368 DOI: 10.1038/s41562-019-0625-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Accepted: 05/02/2019] [Indexed: 11/08/2022]
Abstract
Current cognitive theories are cast in terms of information-processing mechanisms that use mental representations1-4. For example, people use their mental representations to identify familiar faces under various conditions of pose, illumination and ageing, or to draw resemblance between family members. Yet, the actual information contents of these representations are rarely characterized, which hinders knowledge of the mechanisms that use them. Here, we modelled the three-dimensional representational contents of 4 faces that were familiar to 14 participants as work colleagues. The representational contents were created by reverse-correlating identity information generated on each trial with judgements of the face's similarity to the individual participant's memory of this face. In a second study, testing new participants, we demonstrated the validity of the modelled contents using everyday face tasks that generalize identity judgements to new viewpoints, age and sex. Our work highlights that such models of mental representations are critical to understanding generalization behaviour and its underlying information-processing mechanisms.
Collapse
|
33
|
Holzinger Y, Ullman S, Harari D, Behrmann M, Avidan G. Minimal Recognizable Configurations Elicit Category-selective Responses in Higher Order Visual Cortex. J Cogn Neurosci 2019; 31:1354-1367. [PMID: 31059350 DOI: 10.1162/jocn_a_01420] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Visual object recognition is performed effortlessly by humans notwithstanding the fact that it requires a series of complex computations, which are, as yet, not well understood. Here, we tested a novel account of the representations used for visual recognition and their neural correlates using fMRI. The rationale is based on previous research showing that a set of representations, termed "minimal recognizable configurations" (MIRCs), which are computationally derived and have unique psychophysical characteristics, serve as the building blocks of object recognition. We contrasted the BOLD responses elicited by MIRC images, derived from different categories (faces, objects, and places), sub-MIRCs, which are visually similar to MIRCs, but, instead, result in poor recognition and scrambled, unrecognizable images. Stimuli were presented in blocks, and participants indicated yes/no recognition for each image. We confirmed that MIRCs elicited higher recognition performance compared to sub-MIRCs for all three categories. Whereas fMRI activation in early visual cortex for both MIRCs and sub-MIRCs of each category did not differ from that elicited by scrambled images, high-level visual regions exhibited overall greater activation for MIRCs compared to sub-MIRCs or scrambled images. Moreover, MIRCs and sub-MIRCs from each category elicited enhanced activation in corresponding category-selective regions including fusiform face area and occipital face area (faces), lateral occipital cortex (objects), and parahippocampal place area and transverse occipital sulcus (places). These findings reveal the psychological and neural relevance of MIRCs and enable us to make progress in developing a more complete account of object recognition.
Collapse
|
34
|
Liu ST, Montes-Lourido P, Wang X, Sadagopan S. Optimal features for auditory categorization. Nat Commun 2019; 10:1302. [PMID: 30899018 PMCID: PMC6428858 DOI: 10.1038/s41467-019-09115-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Accepted: 02/20/2019] [Indexed: 01/13/2023] Open
Abstract
Humans and vocal animals use vocalizations to communicate with members of their species. A necessary function of auditory perception is to generalize across the high variability inherent in vocalization production and classify them into behaviorally distinct categories ('words' or 'call types'). Here, we demonstrate that detecting mid-level features in calls achieves production-invariant classification. Starting from randomly chosen marmoset call features, we use a greedy search algorithm to determine the most informative and least redundant features necessary for call classification. High classification performance is achieved using only 10-20 features per call type. Predictions of tuning properties of putative feature-selective neurons accurately match some observed auditory cortical responses. This feature-based approach also succeeds for call categorization in other species, and for other complex classification tasks such as caller identification. Our results suggest that high-level neural representations of sounds are based on task-dependent features optimized for specific computational goals.
Collapse
Affiliation(s)
- Shi Tong Liu
- Department of Bioengineering, University of Pittsburgh, Pittsburgh, 15213, PA, USA
| | - Pilar Montes-Lourido
- Department of Neurobiology, University of Pittsburgh, Pittsburgh, 15213, PA, USA
| | - Xiaoqin Wang
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, 21205, MD, USA
| | - Srivatsun Sadagopan
- Department of Bioengineering, University of Pittsburgh, Pittsburgh, 15213, PA, USA. .,Department of Neurobiology, University of Pittsburgh, Pittsburgh, 15213, PA, USA. .,Department of Otolaryngology, University of Pittsburgh, Pittsburgh, 15213, PA, USA.
| |
Collapse
|
35
|
Surface diagnosticity predicts the high-level representation of regular and irregular object shape in human vision. Atten Percept Psychophys 2019; 81:1589-1608. [PMID: 30864108 PMCID: PMC6647524 DOI: 10.3758/s13414-019-01698-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The human visual system has an extraordinary capacity to compute three-dimensional (3D) shape structure for both geometrically regular and irregular objects. The goal of this study was to shed new light on the underlying representational structures that support this ability. Observers (N = 85) completed two complementary perceptual tasks. Experiment 1 involved whole–part matching of image parts to whole geometrically regular and irregular novel object shapes. Image parts comprised either regions of edge contour, volumetric parts, or surfaces. Performance was better for irregular than for regular objects and interacted with part type: volumes yielded better matching performance than surfaces for regular but not for irregular objects. The basis for this effect was further explored in Experiment 2, which used implicit part–whole repetition priming. Here, we orthogonally manipulated shape regularity and a new factor of surface diagnosticity (how predictive a single surface is of object identity). The results showed that surface diagnosticity, not object shape regularity, determined the differential processing of volumes and surfaces. Regardless of shape regularity, objects with low surface diagnosticity were better primed by volumes than by surfaces. In contrast, objects with high surface diagnosticity showed the opposite pattern. These findings are the first to show that surface diagnosticity plays a fundamental role in object recognition. We propose that surface-based shape primitives—rather than volumetric parts—underlie the derivation of 3D object shape in human vision.
Collapse
|
36
|
Chen L, Singh S, Kailath T, Roychowdhury V. Brain-inspired automated visual object discovery and detection. Proc Natl Acad Sci U S A 2019; 116:96-105. [PMID: 30559207 PMCID: PMC6320548 DOI: 10.1073/pnas.1802103115] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Despite significant recent progress, machine vision systems lag considerably behind their biological counterparts in performance, scalability, and robustness. A distinctive hallmark of the brain is its ability to automatically discover and model objects, at multiscale resolutions, from repeated exposures to unlabeled contextual data and then to be able to robustly detect the learned objects under various nonideal circumstances, such as partial occlusion and different view angles. Replication of such capabilities in a machine would require three key ingredients: (i) access to large-scale perceptual data of the kind that humans experience, (ii) flexible representations of objects, and (iii) an efficient unsupervised learning algorithm. The Internet fortunately provides unprecedented access to vast amounts of visual data. This paper leverages the availability of such data to develop a scalable framework for unsupervised learning of object prototypes-brain-inspired flexible, scale, and shift invariant representations of deformable objects (e.g., humans, motorcycles, cars, airplanes) comprised of parts, their different configurations and views, and their spatial relationships. Computationally, the object prototypes are represented as geometric associative networks using probabilistic constructs such as Markov random fields. We apply our framework to various datasets and show that our approach is computationally scalable and can construct accurate and operational part-aware object models much more efficiently than in much of the recent computer vision literature. We also present efficient algorithms for detection and localization in new scenes of objects and their partial views.
Collapse
Affiliation(s)
- Lichao Chen
- Department of Electrical and Computer Engineering, University of California, Los Angeles, CA 90095
| | - Sudhir Singh
- Department of Electrical and Computer Engineering, University of California, Los Angeles, CA 90095
| | - Thomas Kailath
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305
| | - Vwani Roychowdhury
- Department of Electrical and Computer Engineering, University of California, Los Angeles, CA 90095;
| |
Collapse
|
37
|
What do neurons really want? The role of semantics in cortical representations. PSYCHOLOGY OF LEARNING AND MOTIVATION 2019. [DOI: 10.1016/bs.plm.2019.03.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
38
|
Majaj NJ, Pelli DG. Deep learning-Using machine learning to study biological vision. J Vis 2018; 18:2. [PMID: 30508427 PMCID: PMC6279369 DOI: 10.1167/18.13.2] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Accepted: 08/22/2018] [Indexed: 01/08/2023] Open
Abstract
Many vision science studies employ machine learning, especially the version called "deep learning." Neuroscientists use machine learning to decode neural responses. Perception scientists try to understand how living organisms recognize objects. To them, deep neural networks offer benchmark accuracies for recognition of learned stimuli. Originally machine learning was inspired by the brain. Today, machine learning is used as a statistical tool to decode brain activity. Tomorrow, deep neural networks might become our best model of brain function. This brief overview of the use of machine learning in biological vision touches on its strengths, weaknesses, milestones, controversies, and current directions. Here, we hope to help vision scientists assess what role machine learning should play in their research.
Collapse
Affiliation(s)
- Najib J Majaj
- Center for Neural Science, New York University, New York, NY, USA
| | - Denis G Pelli
- Department of Psychology and Center for Neural Science, New York University, New York, NY, USA
| |
Collapse
|
39
|
Comparing the Visual Representations and Performance of Humans and Deep Neural Networks. CURRENT DIRECTIONS IN PSYCHOLOGICAL SCIENCE 2018. [DOI: 10.1177/0963721418801342] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Although deep neural networks (DNNs) are state-of-the-art artificial intelligence systems, it is unclear what insights, if any, they provide about human intelligence. We address this issue in the domain of visual perception. After briefly describing DNNs, we provide an overview of recent results comparing human visual representations and performance with those of DNNs. In many cases, DNNs acquire visual representations and processing strategies that are very different from those used by people. We conjecture that there are at least two factors preventing them from serving as better psychological models. First, DNNs are currently trained with impoverished data, such as data lacking important visual cues to three-dimensional structure, data lacking multisensory statistical regularities, and data in which stimuli are unconnected to an observer’s actions and goals. Second, DNNs typically lack adaptations to capacity limits, such as attentional mechanisms, visual working memory, and compressed mental representations biased toward preserving task-relevant abstractions.
Collapse
|
40
|
1, 2, 3, Many—Perceptual Integration of Motif Repetitions. Symmetry (Basel) 2018. [DOI: 10.3390/sym10110661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
It is generally assumed that the initial integration of visual information is limited in its spatial extent. Of particular interest is the extent to which image symmetries are detected and integrated. Here we studied the spatial extent of visual integration in textures constructed from wallpaper symmetry groups. Using tools from statistical physics, we obtained images ranging from symmetric ones to completely random ones, whereas the textural elements were of the same quality. Results show that the psychometric curves for 3 × 3 motif repetitions are similar to those of images having more repetitions, whereas an equivalent physical scaling of the images does not alter the performance.
Collapse
|
41
|
Lindsay GW, Miller KD. How biological attention mechanisms improve task performance in a large-scale visual system model. eLife 2018; 7:e38105. [PMID: 30272560 PMCID: PMC6207429 DOI: 10.7554/elife.38105] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2018] [Accepted: 09/28/2018] [Indexed: 11/13/2022] Open
Abstract
How does attentional modulation of neural activity enhance performance? Here we use a deep convolutional neural network as a large-scale model of the visual system to address this question. We model the feature similarity gain model of attention, in which attentional modulation is applied according to neural stimulus tuning. Using a variety of visual tasks, we show that neural modulations of the kind and magnitude observed experimentally lead to performance changes of the kind and magnitude observed experimentally. We find that, at earlier layers, attention applied according to tuning does not successfully propagate through the network, and has a weaker impact on performance than attention applied according to values computed for optimally modulating higher areas. This raises the question of whether biological attention might be applied at least in part to optimize function rather than strictly according to tuning. We suggest a simple experiment to distinguish these alternatives.
Collapse
Affiliation(s)
- Grace W Lindsay
- Center for Theoretical Neuroscience, College of Physicians and SurgeonsColumbia UniversityNew YorkUnited States
- Mortimer B. Zuckerman Mind Brain Behaviour InstituteColumbia UniversityNew YorkUnited States
| | - Kenneth D Miller
- Center for Theoretical Neuroscience, College of Physicians and SurgeonsColumbia UniversityNew YorkUnited States
- Mortimer B. Zuckerman Mind Brain Behaviour InstituteColumbia UniversityNew YorkUnited States
- Swartz Program in Theoretical NeuroscienceKavli Institute for Brain ScienceNew YorkUnited States
- Department of NeuroscienceColumbia UniversityNew YorkUnited States
| |
Collapse
|
42
|
Ben-Yosef G, Ullman S. Image interpretation above and below the object level. Interface Focus 2018; 8:20180020. [PMID: 29951197 PMCID: PMC6015807 DOI: 10.1098/rsfs.2018.0020] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/12/2018] [Indexed: 11/12/2022] Open
Abstract
Computational models of vision have advanced in recent years at a rapid rate, rivalling in some areas human-level performance. Much of the progress to date has focused on analysing the visual scene at the object level-the recognition and localization of objects in the scene. Human understanding of images reaches a richer and deeper image understanding both 'below' the object level, such as identifying and localizing object parts and sub-parts, as well as 'above' the object level, such as identifying object relations, and agents with their actions and interactions. In both cases, understanding depends on recovering meaningful structures in the image, and their components, properties and inter-relations, a process referred here as 'image interpretation'. In this paper, we describe recent directions, based on human and computer vision studies, towards human-like image interpretation, beyond the reach of current schemes, both below the object level, as well as some aspects of image interpretation at the level of meaningful configurations beyond the recognition of individual objects, and in particular, interactions between two people in close contact. In both cases the recognition process depends on the detailed interpretation of so-called 'minimal images', and at both levels recognition depends on combining 'bottom-up' processing, proceeding from low to higher levels of a processing hierarchy, together with 'top-down' processing, proceeding from high to lower levels stages of visual analysis.
Collapse
Affiliation(s)
- Guy Ben-Yosef
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Shimon Ullman
- Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel
| |
Collapse
|
43
|
Gruber LZ, Haruvi A, Basri R, Irani M. Perceptual Dominance in Brief Presentations of Mixed Images: Human Perception vs. Deep Neural Networks. Front Comput Neurosci 2018; 12:57. [PMID: 30087604 PMCID: PMC6066547 DOI: 10.3389/fncom.2018.00057] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2018] [Accepted: 07/03/2018] [Indexed: 11/23/2022] Open
Abstract
Visual perception involves continuously choosing the most prominent inputs while suppressing others. Neuroscientists induce visual competitions in various ways to study why and how the brain makes choices of what to perceive. Recently deep neural networks (DNNs) have been used as models of the ventral stream of the visual system, due to similarities in both accuracy and hierarchy of feature representation. In this study we created non-dynamic visual competitions for humans by briefly presenting mixtures of two images. We then tested feed-forward DNNs with similar mixtures and examined their behavior. We found that both humans and DNNs tend to perceive only one image when presented with a mixture of two. We revealed image parameters which predict this perceptual dominance and compared their predictability for the two visual systems. Our findings can be used to both improve DNNs as models, as well as potentially improve their performance by imitating biological behaviors.
Collapse
Affiliation(s)
- Liron Z Gruber
- Department of Neurobiology, Weizmann Institute of Science, Rehovot, Israel
| | - Aia Haruvi
- Department of Neurobiology, Weizmann Institute of Science, Rehovot, Israel
| | - Ronen Basri
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
| | - Michal Irani
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
| |
Collapse
|
44
|
Kim J, Ricci M, Serre T. Not-So-CLEVR: learning same-different relations strains feedforward neural networks. Interface Focus 2018; 8:20180011. [PMID: 29951191 DOI: 10.1098/rsfs.2018.0011] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/08/2018] [Indexed: 11/12/2022] Open
Abstract
The advent of deep learning has recently led to great successes in various engineering applications. As a prime example, convolutional neural networks, a type of feedforward neural network, now approach human accuracy on visual recognition tasks like image classification and face recognition. However, here we will show that feedforward neural networks struggle to learn abstract visual relations that are effortlessly recognized by non-human primates, birds, rodents and even insects. We systematically study the ability of feedforward neural networks to learn to recognize a variety of visual relations and demonstrate that same-different visual relations pose a particular strain on these networks. Networks fail to learn same-different visual relations when stimulus variability makes rote memorization difficult. Further, we show that learning same-different problems becomes trivial for a feedforward network that is fed with perceptually grouped stimuli. This demonstration and the comparative success of biological vision in learning visual relations suggests that feedback mechanisms such as attention, working memory and perceptual grouping may be the key components underlying human-level abstract visual reasoning.
Collapse
Affiliation(s)
- Junkyung Kim
- Department of Cognitive, Linguistic & Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, RI 02912, USA
| | - Matthew Ricci
- Department of Cognitive, Linguistic & Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, RI 02912, USA
| | - Thomas Serre
- Department of Cognitive, Linguistic & Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, RI 02912, USA
| |
Collapse
|
45
|
Abstract
Existing approaches to describe social interactions consider emotional states or use ad-hoc descriptors for microanalysis of interactions. Such descriptors are different in each context thereby limiting comparisons, and can also mix facets of meaning such as emotional states, short term tactics and long-term goals. To develop a systematic set of concepts for second-by-second social interactions, we suggest a complementary approach based on practices employed in theater. Theater uses the concept of dramatic action, the effort that one makes to change the psychological state of another. Unlike states (e.g. emotions), dramatic actions aim to change states; unlike long-term goals or motivations, dramatic actions can last seconds. We defined a set of 22 basic dramatic action verbs using a lexical approach, such as ‘to threaten’–the effort to incite fear, and ‘to encourage’–the effort to inspire hope or confidence. We developed a set of visual cartoon stimuli for these basic dramatic actions, and find that people can reliably and reproducibly assign dramatic action verbs to these stimuli. We show that each dramatic action can be carried out with different emotions, indicating that the two constructs are distinct. We characterized a principal valence axis of dramatic actions. Finally, we re-analyzed three widely-used interaction coding systems in terms of dramatic actions, to suggest that dramatic actions might serve as a common vocabulary across research contexts. This study thus operationalizes and tests dramatic action as a potentially useful concept for research on social interaction, and in particular on influence tactics.
Collapse
|
46
|
Kass RE, Amari SI, Arai K, Brown EN, Diekman CO, Diesmann M, Doiron B, Eden UT, Fairhall AL, Fiddyment GM, Fukai T, Grün S, Harrison MT, Helias M, Nakahara H, Teramae JN, Thomas PJ, Reimers M, Rodu J, Rotstein HG, Shea-Brown E, Shimazaki H, Shinomoto S, Yu BM, Kramer MA. Computational Neuroscience: Mathematical and Statistical Perspectives. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2018; 5:183-214. [PMID: 30976604 PMCID: PMC6454918 DOI: 10.1146/annurev-statistics-041715-033733] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Mathematical and statistical models have played important roles in neuroscience, especially by describing the electrical activity of neurons recorded individually, or collectively across large networks. As the field moves forward rapidly, new challenges are emerging. For maximal effectiveness, those working to advance computational neuroscience will need to appreciate and exploit the complementary strengths of mechanistic theory and the statistical paradigm.
Collapse
Affiliation(s)
- Robert E Kass
- Carnegie Mellon University, Pittsburgh, PA, USA, 15213;
| | - Shun-Ichi Amari
- RIKEN Brain Science Institute, Wako, Saitama Prefecture, Japan, 351-0198
| | | | - Emery N Brown
- Massachusetts Institute of Technology, Cambridge, MA, USA, 02139
- Harvard Medical School, Boston, MA, USA, 02115
| | | | - Markus Diesmann
- Jülich Research Centre, Jülich, Germany, 52428
- RWTH Aachen University, Aachen, Germany, 52062
| | - Brent Doiron
- University of Pittsburgh, Pittsburgh, PA, USA, 15260
| | - Uri T Eden
- Boston University, Boston, MA, USA, 02215
| | | | | | - Tomoki Fukai
- RIKEN Brain Science Institute, Wako, Saitama Prefecture, Japan, 351-0198
| | - Sonja Grün
- Jülich Research Centre, Jülich, Germany, 52428
- RWTH Aachen University, Aachen, Germany, 52062
| | | | - Moritz Helias
- Jülich Research Centre, Jülich, Germany, 52428
- RWTH Aachen University, Aachen, Germany, 52062
| | - Hiroyuki Nakahara
- RIKEN Brain Science Institute, Wako, Saitama Prefecture, Japan, 351-0198
| | | | - Peter J Thomas
- Case Western Reserve University, Cleveland, OH, USA, 44106
| | - Mark Reimers
- Michigan State University, East Lansing, MI, USA, 48824
| | - Jordan Rodu
- Carnegie Mellon University, Pittsburgh, PA, USA, 15213;
| | | | | | - Hideaki Shimazaki
- Honda Research Institute Japan, Wako, Saitama Prefecture, Japan, 351-0188
- Kyoto University, Kyoto, Kyoto Prefecture, Japan, 606-8502
| | | | - Byron M Yu
- Carnegie Mellon University, Pittsburgh, PA, USA, 15213;
| | | |
Collapse
|
47
|
Ben-Yosef G, Assif L, Ullman S. Full interpretation of minimal images. Cognition 2017; 171:65-84. [PMID: 29107889 DOI: 10.1016/j.cognition.2017.10.006] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2017] [Revised: 10/04/2017] [Accepted: 10/06/2017] [Indexed: 10/18/2022]
Abstract
The goal in this work is to model the process of 'full interpretation' of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low. We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of 'minimal configurations': these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition.
Collapse
Affiliation(s)
- Guy Ben-Yosef
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Liav Assif
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Shimon Ullman
- Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot 7610001, Israel; Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
48
|
Karimi-Rouzbahani H, Bagheri N, Ebrahimpour R. Invariant object recognition is a personalized selection of invariant features in humans, not simply explained by hierarchical feed-forward vision models. Sci Rep 2017; 7:14402. [PMID: 29089520 PMCID: PMC5663844 DOI: 10.1038/s41598-017-13756-8] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Accepted: 09/26/2017] [Indexed: 11/20/2022] Open
Abstract
One key ability of human brain is invariant object recognition, which refers to rapid and accurate recognition of objects in the presence of variations such as size, rotation and position. Despite decades of research into the topic, it remains unknown how the brain constructs invariant representations of objects. Providing brain-plausible object representations and reaching human-level accuracy in recognition, hierarchical models of human vision have suggested that, human brain implements similar feed-forward operations to obtain invariant representations. However, conducting two psychophysical object recognition experiments on humans with systematically controlled variations of objects, we observed that humans relied on specific (diagnostic) object regions for accurate recognition which remained relatively consistent (invariant) across variations; but feed-forward feature-extraction models selected view-specific (non-invariant) features across variations. This suggests that models can develop different strategies, but reach human-level recognition performance. Moreover, human individuals largely disagreed on their diagnostic features and flexibly shifted their feature extraction strategy from view-invariant to view-specific when objects became more similar. This implies that, even in rapid object recognition, rather than a set of feed-forward mechanisms which extract diagnostic features from objects in a hard-wired fashion, the bottom-up visual pathways receive, through top-down connections, task-related information possibly processed in prefrontal cortex.
Collapse
Affiliation(s)
- Hamid Karimi-Rouzbahani
- Department of Electrical Engineering, Shahid Rajaee Teacher Training University, Tehran, Iran
- Cognitive Science Research lab., Department of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran, Iran
- School of Cognitive Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
| | - Nasour Bagheri
- Department of Electrical Engineering, Shahid Rajaee Teacher Training University, Tehran, Iran
| | - Reza Ebrahimpour
- Cognitive Science Research lab., Department of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran, Iran.
- School of Cognitive Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| |
Collapse
|
49
|
Peelen MV, Downing PE. Category selectivity in human visual cortex: Beyond visual object recognition. Neuropsychologia 2017; 105:177-183. [DOI: 10.1016/j.neuropsychologia.2017.03.033] [Citation(s) in RCA: 77] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2016] [Revised: 03/29/2017] [Accepted: 03/31/2017] [Indexed: 11/16/2022]
|
50
|
Shvedchenko DO, Suvorova EI. Combination of thresholding and fitting methods for measuring nanoparticle sizes and size distributions in (S)TEM. Microsc Res Tech 2017; 80:1113-1122. [PMID: 28699651 DOI: 10.1002/jemt.22908] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2017] [Revised: 06/22/2017] [Accepted: 06/22/2017] [Indexed: 11/06/2022]
Abstract
The practical need for a simple and reliable tool for routine size analysis of nanoparticles with diameters down to a few nm embedded in a polymer matrix motivated the development of a new approach. The idea underlying the method proposed in this work is to combine intensity thresholding and contrast fitting procedures in the same software for particle recognition and measurements of sizes and size distributions of nanoparticles in transmission and scanning transmission electron microscopy images. Particle recognition in images is performed in an interactive process of manual setting the numerical threshold level after image preprocessing. We show that fitting the calculated gray level distribution to the real images is able to provide a maximum accuracy in measurements of the particle diameters in contrast to thresholding approaches. The fitting procedure is applied in the vicinity of nanoparticle images with the mass-thickness, diffraction, and chemical contrast. The grayscale function associated to the nanoparticle thickness is described using polynomial gt=g0+g1t+g2t2+g3t3… with degree ⩾ 2 and undetermined coefficients. The program for particle detection and size measurement-Analyzer of Nanoparticles (AnNa)-has been written and is described here. It was successfully tested on systems containing Ag nanoparticles grown and stabilized in aqueous solutions of different polymers for biomedical use and is available from the authors.
Collapse
Affiliation(s)
- Dmitry O Shvedchenko
- Shubnikov Institute of Crystallography, Electron Diffraction Laboratory, Russian Academy of Sciences, Leninsky pr., 59, Moscow, 119333, Russia
| | - Elena I Suvorova
- Shubnikov Institute of Crystallography, Electron Diffraction Laboratory, Russian Academy of Sciences, Leninsky pr., 59, Moscow, 119333, Russia
| |
Collapse
|