1
|
Depeweg S, Rothkopf CA, Jäkel F. Solving Bongard Problems With a Visual Language and Pragmatic Constraints. Cogn Sci 2024; 48:e13432. [PMID: 38700123 DOI: 10.1111/cogs.13432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 02/15/2024] [Accepted: 02/26/2024] [Indexed: 05/05/2024]
Abstract
More than 50 years ago, Bongard introduced 100 visual concept learning problems as a challenge for artificial vision systems. These problems are now known as Bongard problems. Although they are well known in cognitive science and artificial intelligence, only very little progress has been made toward building systems that can solve a substantial subset of them. In the system presented here, visual features are extracted through image processing and then translated into a symbolic visual vocabulary. We introduce a formal language that allows representing compositional visual concepts based on this vocabulary. Using this language and Bayesian inference, concepts can be induced from the examples that are provided in each problem. We find a reasonable agreement between the concepts with high posterior probability and the solutions formulated by Bongard himself for a subset of 35 problems. While this approach is far from solving Bongard problems like humans, it does considerably better than previous approaches. We discuss the issues we encountered while developing this system and their continuing relevance for understanding visual cognition. For instance, contrary to other concept learning problems, the examples are not random in Bongard problems; instead they are carefully chosen to ensure that the concept can be induced, and we found it helpful to take the resulting pragmatic constraints into account.
Collapse
Affiliation(s)
| | - Contantin A Rothkopf
- Centre for Cognitive Science & Institute of Psychology, Technische Universität Darmstadt
- Frankfurt Institute for Advanced Studies, Frankfurt am Main
| | - Frank Jäkel
- Centre for Cognitive Science & Institute of Psychology, Technische Universität Darmstadt
| |
Collapse
|
2
|
Yildirim I, Siegel MH, Soltani AA, Ray Chaudhuri S, Tenenbaum JB. Perception of 3D shape integrates intuitive physics and analysis-by-synthesis. Nat Hum Behav 2024; 8:320-335. [PMID: 37996497 DOI: 10.1038/s41562-023-01759-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Accepted: 10/12/2023] [Indexed: 11/25/2023]
Abstract
Many surface cues support three-dimensional shape perception, but humans can sometimes still see shape when these features are missing-such as when an object is covered with a draped cloth. Here we propose a framework for three-dimensional shape perception that explains perception in both typical and atypical cases as analysis-by-synthesis, or inference in a generative model of image formation. The model integrates intuitive physics to explain how shape can be inferred from the deformations it causes to other objects, as in cloth draping. Behavioural and computational studies comparing this account with several alternatives show that it best matches human observers (total n = 174) in both accuracy and response times, and is the only model that correlates significantly with human performance on difficult discriminations. We suggest that bottom-up deep neural network models are not fully adequate accounts of human shape perception, and point to how machine vision systems might achieve more human-like robustness.
Collapse
Affiliation(s)
- Ilker Yildirim
- Department of Psychology, Yale University, New Haven, CT, USA.
- Department of Statistics & Data Science, Yale University, New Haven, CT, USA.
- Wu-Tsai Institute, Yale University, New Haven, CT, USA.
| | - Max H Siegel
- Department of Brain & Cognitive Sciences, MIT, Cambridge, MA, USA.
- The Center for Brains, Minds, and Machines, MIT, Cambridge, MA, USA.
| | - Amir A Soltani
- Department of Brain & Cognitive Sciences, MIT, Cambridge, MA, USA
- The Center for Brains, Minds, and Machines, MIT, Cambridge, MA, USA
| | | | - Joshua B Tenenbaum
- Department of Brain & Cognitive Sciences, MIT, Cambridge, MA, USA.
- The Center for Brains, Minds, and Machines, MIT, Cambridge, MA, USA.
| |
Collapse
|
3
|
German JS, Jacobs RA. Implications of capacity-limited, generative models for human vision. Behav Brain Sci 2023; 46:e391. [PMID: 38054373 DOI: 10.1017/s0140525x23001772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
Although discriminative deep neural networks are currently dominant in cognitive modeling, we suggest that capacity-limited, generative models are a promising avenue for future work. Generative models tend to learn both local and global features of stimuli and, when properly constrained, can learn componential representations and response biases found in people's behaviors.
Collapse
Affiliation(s)
- Joseph Scott German
- Department of Cognitive Science, University of California, San Diego, La Jolla, CA, USA
| | - Robert A Jacobs
- Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA https://www2.bcs.rochester.edu/sites/jacobslab/people.html
| |
Collapse
|
4
|
Lee MJ, DiCarlo JJ. How well do rudimentary plasticity rules predict adult visual object learning? PLoS Comput Biol 2023; 19:e1011713. [PMID: 38079444 PMCID: PMC10754461 DOI: 10.1371/journal.pcbi.1011713] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Revised: 12/28/2023] [Accepted: 11/27/2023] [Indexed: 12/29/2023] Open
Abstract
A core problem in visual object learning is using a finite number of images of a new object to accurately identify that object in future, novel images. One longstanding, conceptual hypothesis asserts that this core problem is solved by adult brains through two connected mechanisms: 1) the re-representation of incoming retinal images as points in a fixed, multidimensional neural space, and 2) the optimization of linear decision boundaries in that space, via simple plasticity rules applied to a single downstream layer. Though this scheme is biologically plausible, the extent to which it explains learning behavior in humans has been unclear-in part because of a historical lack of image-computable models of the putative neural space, and in part because of a lack of measurements of human learning behaviors in difficult, naturalistic settings. Here, we addressed these gaps by 1) drawing from contemporary, image-computable models of the primate ventral visual stream to create a large set of testable learning models (n = 2,408 models), and 2) using online psychophysics to measure human learning trajectories over a varied set of tasks involving novel 3D objects (n = 371,000 trials), which we then used to develop (and publicly release) empirical benchmarks for comparing learning models to humans. We evaluated each learning model on these benchmarks, and found those based on deep, high-level representations from neural networks were surprisingly aligned with human behavior. While no tested model explained the entirety of replicable human behavior, these results establish that rudimentary plasticity rules, when combined with appropriate visual representations, have high explanatory power in predicting human behavior with respect to this core object learning problem.
Collapse
Affiliation(s)
- Michael J. Lee
- Department of Brain and Cognitive Sciences, MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds and Machines, MIT, Cambridge, Massachusetts, United States of America
| | - James J. DiCarlo
- Department of Brain and Cognitive Sciences, MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds and Machines, MIT, Cambridge, Massachusetts, United States of America
- McGovern Institute for Brain Research, MIT, Cambridge, Massachusetts, United States of America
| |
Collapse
|
5
|
Aldegheri G, Gayet S, Peelen MV. Scene context automatically drives predictions of object transformations. Cognition 2023; 238:105521. [PMID: 37354785 DOI: 10.1016/j.cognition.2023.105521] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 06/07/2023] [Accepted: 06/08/2023] [Indexed: 06/26/2023]
Abstract
As our viewpoint changes, the whole scene around us rotates coherently. This allows us to predict how one part of a scene (e.g., an object) will change by observing other parts (e.g., the scene background). While human object perception is known to be strongly context-dependent, previous research has largely focused on how scene context can disambiguate fixed object properties, such as identity (e.g., a car is easier to recognize on a road than on a beach). It remains an open question whether object representations are updated dynamically based on the surrounding scene context, for example across changes in viewpoint. Here, we tested whether human observers dynamically and automatically predict the appearance of objects based on the orientation of the background scene. In three behavioral experiments (N = 152), we temporarily occluded objects within scenes that rotated. Upon the objects' reappearance, participants had to perform a perceptual discrimination task, which did not require taking the scene rotation into account. Performance on this orthogonal task strongly depended on whether objects reappeared rotated coherently with the surrounding scene or not. This effect persisted even when a majority of trials violated this real-world contingency between scene and object, showcasing the automaticity of these scene-based predictions. These findings indicate that contextual information plays an important role in predicting object transformations in structured real-world environments.
Collapse
Affiliation(s)
- Giacomo Aldegheri
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Thomas van Aquinostraat 4, Nijmegen 6525 GD, the Netherlands; Department of Psychology, Amsterdam Brain & Cognition Center, University of Amsterdam, Nieuwe Achtergracht 129-B, Amsterdam 1018 WS, the Netherlands.
| | - Surya Gayet
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Thomas van Aquinostraat 4, Nijmegen 6525 GD, the Netherlands; Department of Experimental Psychology, Helmholtz Institute, Utrecht University, Heidelberglaan 1, Utrecht 3584 CS, the Netherlands
| | - Marius V Peelen
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Thomas van Aquinostraat 4, Nijmegen 6525 GD, the Netherlands
| |
Collapse
|
6
|
Linton P, Morgan MJ, Read JCA, Vishwanath D, Creem-Regehr SH, Domini F. New Approaches to 3D Vision. Philos Trans R Soc Lond B Biol Sci 2023; 378:20210443. [PMID: 36511413 PMCID: PMC9745878 DOI: 10.1098/rstb.2021.0443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 10/25/2022] [Indexed: 12/15/2022] Open
Abstract
New approaches to 3D vision are enabling new advances in artificial intelligence and autonomous vehicles, a better understanding of how animals navigate the 3D world, and new insights into human perception in virtual and augmented reality. Whilst traditional approaches to 3D vision in computer vision (SLAM: simultaneous localization and mapping), animal navigation (cognitive maps), and human vision (optimal cue integration) start from the assumption that the aim of 3D vision is to provide an accurate 3D model of the world, the new approaches to 3D vision explored in this issue challenge this assumption. Instead, they investigate the possibility that computer vision, animal navigation, and human vision can rely on partial or distorted models or no model at all. This issue also highlights the implications for artificial intelligence, autonomous vehicles, human perception in virtual and augmented reality, and the treatment of visual disorders, all of which are explored by individual articles. This article is part of a discussion meeting issue 'New approaches to 3D vision'.
Collapse
Affiliation(s)
- Paul Linton
- Presidential Scholars in Society and Neuroscience, Center for Science and Society, Columbia University, New York, NY 10027, USA
- Italian Academy for Advanced Studies in America, Columbia University, New York, NY 10027, USA
- Visual Inference Lab, Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY 10027, USA
| | - Michael J. Morgan
- Department of Optometry and Visual Sciences, City, University of London, Northampton Square, London EC1V 0HB, UK
| | - Jenny C. A. Read
- Biosciences Institute, Newcastle University, Newcastle upon Tyne, Tyne & Wear NE2 4HH, UK
| | - Dhanraj Vishwanath
- School of Psychology and Neuroscience, University of St Andrews, St Andrews, Fife KY16 9JP, UK
| | | | - Fulvio Domini
- Department of Cognitive, Linguistic, and Psychological Sciences, Brown University, Providence, RI 02912-9067, USA
| |
Collapse
|
7
|
Domini F. The case against probabilistic inference: a new deterministic theory of 3D visual processing. Philos Trans R Soc Lond B Biol Sci 2023; 378:20210458. [PMID: 36511407 PMCID: PMC9745883 DOI: 10.1098/rstb.2021.0458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 10/03/2022] [Indexed: 12/15/2022] Open
Abstract
How the brain derives 3D information from inherently ambiguous visual input remains the fundamental question of human vision. The past two decades of research have addressed this question as a problem of probabilistic inference, the dominant model being maximum-likelihood estimation (MLE). This model assumes that independent depth-cue modules derive noisy but statistically accurate estimates of 3D scene parameters that are combined through a weighted average. Cue weights are adjusted based on the system representation of each module's output variability. Here I demonstrate that the MLE model fails to account for important psychophysical findings and, importantly, misinterprets the just noticeable difference, a hallmark measure of stimulus discriminability, to be an estimate of perceptual uncertainty. I propose a new theory, termed Intrinsic Constraint, which postulates that the visual system does not derive the most probable interpretation of the visual input, but rather, the most stable interpretation amid variations in viewing conditions. This goal is achieved with the Vector Sum model, which represents individual cue estimates as components of a multi-dimensional vector whose norm determines the combined output. This model accounts for the psychophysical findings cited in support of MLE, while predicting existing and new findings that contradict the MLE model. This article is part of a discussion meeting issue 'New approaches to 3D vision'.
Collapse
Affiliation(s)
- Fulvio Domini
- CLPS, Brown University, 190 Thayer Street Providence, Rhode Island 02912-9067, USA
| |
Collapse
|
8
|
Shvadron S, Snir A, Maimon A, Yizhar O, Harel S, Poradosu K, Amedi A. Shape detection beyond the visual field using a visual-to-auditory sensory augmentation device. Front Hum Neurosci 2023; 17:1058617. [PMID: 36936618 PMCID: PMC10017858 DOI: 10.3389/fnhum.2023.1058617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 01/09/2023] [Indexed: 03/06/2023] Open
Abstract
Current advancements in both technology and science allow us to manipulate our sensory modalities in new and unexpected ways. In the present study, we explore the potential of expanding what we perceive through our natural senses by utilizing a visual-to-auditory sensory substitution device (SSD), the EyeMusic, an algorithm that converts images to sound. The EyeMusic was initially developed to allow blind individuals to create a spatial representation of information arriving from a video feed at a slow sampling rate. In this study, we aimed to use the EyeMusic for the blind areas of sighted individuals. We use it in this initial proof-of-concept study to test the ability of sighted subjects to combine visual information with surrounding auditory sonification representing visual information. Participants in this study were tasked with recognizing and adequately placing the stimuli, using sound to represent the areas outside the standard human visual field. As such, the participants were asked to report shapes' identities as well as their spatial orientation (front/right/back/left), requiring combined visual (90° frontal) and auditory input (the remaining 270°) for the successful performance of the task (content in both vision and audition was presented in a sweeping clockwise motion around the participant). We found that participants were successful at a highly above chance level after a brief 1-h-long session of online training and one on-site training session of an average of 20 min. They could even draw a 2D representation of this image in some cases. Participants could also generalize, recognizing new shapes they were not explicitly trained on. Our findings provide an initial proof of concept indicating that sensory augmentation devices and techniques can potentially be used in combination with natural sensory information in order to expand the natural fields of sensory perception.
Collapse
Affiliation(s)
- Shira Shvadron
- Baruch Ivcher School of Psychology, The Baruch Ivcher Institute for Brain, Cognition, and Technology, Reichman University, Herzliya, Israel
- The Ruth and Meir Rosenthal, Brain Imaging Center, Reichman University, Herzliya, Israel
- *Correspondence: Shira Shvadron,
| | - Adi Snir
- Baruch Ivcher School of Psychology, The Baruch Ivcher Institute for Brain, Cognition, and Technology, Reichman University, Herzliya, Israel
- The Ruth and Meir Rosenthal, Brain Imaging Center, Reichman University, Herzliya, Israel
| | - Amber Maimon
- Baruch Ivcher School of Psychology, The Baruch Ivcher Institute for Brain, Cognition, and Technology, Reichman University, Herzliya, Israel
- The Ruth and Meir Rosenthal, Brain Imaging Center, Reichman University, Herzliya, Israel
| | - Or Yizhar
- Baruch Ivcher School of Psychology, The Baruch Ivcher Institute for Brain, Cognition, and Technology, Reichman University, Herzliya, Israel
- The Ruth and Meir Rosenthal, Brain Imaging Center, Reichman University, Herzliya, Israel
- Research Group Adaptive Memory and Decision Making, Max Planck Institute for Human Development, Berlin, Germany
- Max Planck Dahlem Campus of Cognition (MPDCC), Max Planck Institute for Human Development, Berlin, Germany
| | - Sapir Harel
- Baruch Ivcher School of Psychology, The Baruch Ivcher Institute for Brain, Cognition, and Technology, Reichman University, Herzliya, Israel
- The Ruth and Meir Rosenthal, Brain Imaging Center, Reichman University, Herzliya, Israel
| | - Keinan Poradosu
- Baruch Ivcher School of Psychology, The Baruch Ivcher Institute for Brain, Cognition, and Technology, Reichman University, Herzliya, Israel
- The Ruth and Meir Rosenthal, Brain Imaging Center, Reichman University, Herzliya, Israel
- Weizmann Institute of Science, Rehovot, Israel
| | - Amir Amedi
- Baruch Ivcher School of Psychology, The Baruch Ivcher Institute for Brain, Cognition, and Technology, Reichman University, Herzliya, Israel
- The Ruth and Meir Rosenthal, Brain Imaging Center, Reichman University, Herzliya, Israel
| |
Collapse
|
9
|
Bowers JS, Malhotra G, Dujmović M, Llera Montero M, Tsvetkov C, Biscione V, Puebla G, Adolfi F, Hummel JE, Heaton RF, Evans BD, Mitchell J, Blything R. Deep problems with neural network models of human vision. Behav Brain Sci 2022; 46:e385. [PMID: 36453586 DOI: 10.1017/s0140525x22002813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Deep neural networks (DNNs) have had extraordinary successes in classifying photographic images of objects and are often described as the best models of biological vision. This conclusion is largely based on three sets of findings: (1) DNNs are more accurate than any other model in classifying images taken from various datasets, (2) DNNs do the best job in predicting the pattern of human errors in classifying objects taken from various behavioral datasets, and (3) DNNs do the best job in predicting brain signals in response to images taken from various brain datasets (e.g., single cell responses or fMRI data). However, these behavioral and brain datasets do not test hypotheses regarding what features are contributing to good predictions and we show that the predictions may be mediated by DNNs that share little overlap with biological vision. More problematically, we show that DNNs account for almost no results from psychological research. This contradicts the common claim that DNNs are good, let alone the best, models of human object recognition. We argue that theorists interested in developing biologically plausible models of human vision need to direct their attention to explaining psychological findings. More generally, theorists need to build models that explain the results of experiments that manipulate independent variables designed to test hypotheses rather than compete on making the best predictions. We conclude by briefly summarizing various promising modeling approaches that focus on psychological data.
Collapse
Affiliation(s)
- Jeffrey S Bowers
- School of Psychological Science, University of Bristol, Bristol, UK ; https://jeffbowers.blogs.bristol.ac.uk/
| | - Gaurav Malhotra
- School of Psychological Science, University of Bristol, Bristol, UK ; https://jeffbowers.blogs.bristol.ac.uk/
| | - Marin Dujmović
- School of Psychological Science, University of Bristol, Bristol, UK ; https://jeffbowers.blogs.bristol.ac.uk/
| | - Milton Llera Montero
- School of Psychological Science, University of Bristol, Bristol, UK ; https://jeffbowers.blogs.bristol.ac.uk/
| | - Christian Tsvetkov
- School of Psychological Science, University of Bristol, Bristol, UK ; https://jeffbowers.blogs.bristol.ac.uk/
| | - Valerio Biscione
- School of Psychological Science, University of Bristol, Bristol, UK ; https://jeffbowers.blogs.bristol.ac.uk/
| | - Guillermo Puebla
- School of Psychological Science, University of Bristol, Bristol, UK ; https://jeffbowers.blogs.bristol.ac.uk/
| | - Federico Adolfi
- School of Psychological Science, University of Bristol, Bristol, UK ; https://jeffbowers.blogs.bristol.ac.uk/
- Ernst Strüngmann Institute (ESI) for Neuroscience in Cooperation with Max Planck Society, Frankfurt am Main, Germany
| | - John E Hummel
- Department of Psychology, University of Illinois Urbana-Champaign, Champaign, IL, USA
| | - Rachel F Heaton
- Department of Psychology, University of Illinois Urbana-Champaign, Champaign, IL, USA
| | - Benjamin D Evans
- Department of Informatics, School of Engineering and Informatics, University of Sussex, Brighton, UK
| | - Jeffrey Mitchell
- Department of Informatics, School of Engineering and Informatics, University of Sussex, Brighton, UK
| | - Ryan Blything
- School of Psychology, Aston University, Birmingham, UK
| |
Collapse
|
10
|
Ayzenberg V, Behrmann M. Does the brain's ventral visual pathway compute object shape? Trends Cogn Sci 2022; 26:1119-1132. [PMID: 36272937 DOI: 10.1016/j.tics.2022.09.019] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 09/22/2022] [Accepted: 09/26/2022] [Indexed: 11/11/2022]
Abstract
A rich behavioral literature has shown that human object recognition is supported by a representation of shape that is tolerant to variations in an object's appearance. Such 'global' shape representations are achieved by describing objects via the spatial arrangement of their local features, or structure, rather than by the appearance of the features themselves. However, accumulating evidence suggests that the ventral visual pathway - the primary substrate underlying object recognition - may not represent global shape. Instead, ventral representations may be better described as a basis set of local image features. We suggest that this evidence forces a reevaluation of the role of the ventral pathway in object perception and posits a broader network for shape perception that encompasses contributions from the dorsal pathway.
Collapse
Affiliation(s)
- Vladislav Ayzenberg
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Psychology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
| | - Marlene Behrmann
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA; Psychology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA; The Department of Ophthalmology, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| |
Collapse
|
11
|
Abstract
Shape is an interesting property of objects because it is used in ordinary discourse in ways that seem to have little connection to how it is typically defined in mathematics. The present article describes how the concept of shape can be grounded within Euclidean and non-Euclidean geometry and also to human perception. It considers the formal methods that have been proposed for measuring the differences among shapes and how the performance of those methods compares with shape difference thresholds of human observers. It discusses how different types of shape change can be perceptually categorized. It also evaluates the specific data structures that have been used to represent shape in models of both human and machine vision, and it reviews the psychophysical evidence about the extent to which those models are consistent with human perception. Based on this review of the literature, we argue that shape is not one thing but rather a collection of many object attributes, some of which are more perceptually salient than others. Because the relative importance of these attributes can be context dependent, there is no obvious single definition of shape that is universally applicable in all situations.
Collapse
Affiliation(s)
- James T Todd
- Department of Psychology, The Ohio State University, Columbus, OH, USA
| | | |
Collapse
|
12
|
Biological convolutions improve DNN robustness to noise and generalisation. Neural Netw 2021; 148:96-110. [PMID: 35114495 DOI: 10.1016/j.neunet.2021.12.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 11/11/2021] [Accepted: 12/07/2021] [Indexed: 11/19/2022]
Abstract
Deep Convolutional Neural Networks (DNNs) have achieved superhuman accuracy on standard image classification benchmarks. Their success has reignited significant interest in their use as models of the primate visual system, bolstered by claims of their architectural and representational similarities. However, closer scrutiny of these models suggests that they rely on various forms of shortcut learning to achieve their impressive performance, such as using texture rather than shape information. Such superficial solutions to image recognition have been shown to make DNNs brittle in the face of more challenging tests such as noise-perturbed or out-of-distribution images, casting doubt on their similarity to their biological counterparts. In the present work, we demonstrate that adding fixed biological filter banks, in particular banks of Gabor filters, helps to constrain the networks to avoid reliance on shortcuts, making them develop more structured internal representations and more tolerance to noise. Importantly, they also gained around 20-35% improved accuracy when generalising to our novel out-of-distribution test image sets over standard end-to-end trained architectures. We take these findings to suggest that these properties of the primate visual system should be incorporated into DNNs to make them more able to cope with real-world vision and better capture some of the more impressive aspects of human visual perception such as generalisation.
Collapse
|
13
|
Morgenstern Y, Hartmann F, Schmidt F, Tiedemann H, Prokott E, Maiello G, Fleming RW. An image-computable model of human visual shape similarity. PLoS Comput Biol 2021; 17:e1008981. [PMID: 34061825 PMCID: PMC8195351 DOI: 10.1371/journal.pcbi.1008981] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Revised: 06/11/2021] [Accepted: 04/19/2021] [Indexed: 11/19/2022] Open
Abstract
Shape is a defining feature of objects, and human observers can effortlessly compare shapes to determine how similar they are. Yet, to date, no image-computable model can predict how visually similar or different shapes appear. Such a model would be an invaluable tool for neuroscientists and could provide insights into computations underlying human shape perception. To address this need, we developed a model (‘ShapeComp’), based on over 100 shape features (e.g., area, compactness, Fourier descriptors). When trained to capture the variance in a database of >25,000 animal silhouettes, ShapeComp accurately predicts human shape similarity judgments between pairs of shapes without fitting any parameters to human data. To test the model, we created carefully selected arrays of complex novel shapes using a Generative Adversarial Network trained on the animal silhouettes, which we presented to observers in a wide range of tasks. Our findings show that incorporating multiple ShapeComp dimensions facilitates the prediction of human shape similarity across a small number of shapes, and also captures much of the variance in the multiple arrangements of many shapes. ShapeComp outperforms both conventional pixel-based metrics and state-of-the-art convolutional neural networks, and can also be used to generate perceptually uniform stimulus sets, making it a powerful tool for investigating shape and object representations in the human brain. The ability to describe and compare shapes is crucial in many scientific domains from visual object recognition to computational morphology and computer graphics. Across disciplines, considerable effort has been devoted to the study of shape and its influence on object recognition, yet an important stumbling block is the quantitative characterization of shape similarity. Here we develop a psychophysically validated model that takes as input an object’s shape boundary and provides a high-dimensional output that can be used for predicting visual shape similarity. With this precise control of shape similarity, the model’s description of shape is a powerful tool that can be used across the neurosciences and artificial intelligence to test role of shape in perception and the brain.
Collapse
Affiliation(s)
- Yaniv Morgenstern
- Department of Experimental Psychology, Justus-Liebig University Giessen, Giessen, Germany
- * E-mail:
| | - Frieder Hartmann
- Department of Experimental Psychology, Justus-Liebig University Giessen, Giessen, Germany
| | - Filipp Schmidt
- Department of Experimental Psychology, Justus-Liebig University Giessen, Giessen, Germany
| | - Henning Tiedemann
- Department of Experimental Psychology, Justus-Liebig University Giessen, Giessen, Germany
| | - Eugen Prokott
- Department of Experimental Psychology, Justus-Liebig University Giessen, Giessen, Germany
| | - Guido Maiello
- Department of Experimental Psychology, Justus-Liebig University Giessen, Giessen, Germany
| | - Roland W. Fleming
- Department of Experimental Psychology, Justus-Liebig University Giessen, Giessen, Germany
- Center for Mind, Brain and Behavior (CMBB), University of Marburg and Justus Liebig University Giessen, Giessen, Germany
| |
Collapse
|
14
|
Martolini C, Cappagli G, Campus C, Gori M. Shape Recognition With Sounds: Improvement in Sighted Individuals After Audio-Motor Training. Multisens Res 2020; 33:417-431. [PMID: 31751938 DOI: 10.1163/22134808-20191460] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2019] [Accepted: 09/23/2019] [Indexed: 11/19/2022]
Abstract
Recent studies have demonstrated that audition used to complement or substitute visual feedback is effective in conveying spatial information, e.g., sighted individuals can understand the curvature of a shape when solely auditory input is provided. Recently we also demonstrated that, in the absence of vision, auditory feedback of body movements can enhance spatial perception in visually impaired adults and children. In the present study, we assessed whether sighted adults can also improve their spatial abilities related to shape recognition with an audio-motor training based on the idea that the coupling of auditory and motor information can further refine the representation of space when vision is missing. Auditory shape recognition was assessed in 22 blindfolded sighted adults with an auditory task requiring participants to identify four shapes by means of the sound conveyed through a set of consecutive loudspeakers embedded on a fixed two-dimensional vertical array. We divided participants into two groups of 11 adults each, performing a training session in two different modalities: active audio-motor training (experimental group) and passive auditory training (control group). The audio-motor training consisted in the reproduction of specific movements with the arm by relying on the sound produced by an auditory source positioned on the wrist of participants. Results showed that sighted individuals improved the recognition of auditory shapes only after active training, suggesting that audio-motor feedback can be an effective tool to enhance spatial representation when visual information is lacking.
Collapse
Affiliation(s)
- Chiara Martolini
- 1Unit for Visually Impaired People, Istituto Italiano di Tecnologia, Genoa, Italy.,2DIBRIS, University of Genoa, Genoa, Italy
| | - Giulia Cappagli
- 1Unit for Visually Impaired People, Istituto Italiano di Tecnologia, Genoa, Italy.,3IRCCS Fondazione Istituto Neurologico Nazionale C. Mondino, Pavia, Italy
| | - Claudio Campus
- 1Unit for Visually Impaired People, Istituto Italiano di Tecnologia, Genoa, Italy
| | - Monica Gori
- 1Unit for Visually Impaired People, Istituto Italiano di Tecnologia, Genoa, Italy
| |
Collapse
|
15
|
Yildirim I, Belledonne M, Freiwald W, Tenenbaum J. Efficient inverse graphics in biological face processing. SCIENCE ADVANCES 2020; 6:eaax5979. [PMID: 32181338 PMCID: PMC7056304 DOI: 10.1126/sciadv.aax5979] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Accepted: 12/11/2019] [Indexed: 05/25/2023]
Abstract
Vision not only detects and recognizes objects, but performs rich inferences about the underlying scene structure that causes the patterns of light we see. Inverting generative models, or "analysis-by-synthesis", presents a possible solution, but its mechanistic implementations have typically been too slow for online perception, and their mapping to neural circuits remains unclear. Here we present a neurally plausible efficient inverse graphics model and test it in the domain of face recognition. The model is based on a deep neural network that learns to invert a three-dimensional face graphics program in a single fast feedforward pass. It explains human behavior qualitatively and quantitatively, including the classic "hollow face" illusion, and it maps directly onto a specialized face-processing circuit in the primate brain. The model fits both behavioral and neural data better than state-of-the-art computer vision models, and suggests an interpretable reverse-engineering account of how the brain transforms images into percepts.
Collapse
Affiliation(s)
- Ilker Yildirim
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA
- Department of Psychology, Yale University, New Haven, CT, USA
- Department of Statistics and Data Science, Yale University, New Haven, CT, USA
- The Center for Brains, Minds and Machines, MIT, Cambridge, MA, USA
| | - Mario Belledonne
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA
- Department of Psychology, Yale University, New Haven, CT, USA
- The Center for Brains, Minds and Machines, MIT, Cambridge, MA, USA
| | - Winrich Freiwald
- The Center for Brains, Minds and Machines, MIT, Cambridge, MA, USA
- Laboratory of Neural Systems, The Rockefeller University, New York, NY, USA
| | - Josh Tenenbaum
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, USA
- The Center for Brains, Minds and Machines, MIT, Cambridge, MA, USA
| |
Collapse
|
16
|
German JS, Jacobs RA. Can machine learning account for human visual object shape similarity judgments? Vision Res 2020; 167:87-99. [PMID: 31972448 DOI: 10.1016/j.visres.2019.12.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 10/22/2019] [Accepted: 12/12/2019] [Indexed: 11/27/2022]
Abstract
We describe and analyze the performance of metric learning systems, including deep neural networks (DNNs), on a new dataset of human visual object shape similarity judgments of naturalistic, part-based objects known as "Fribbles". In contrast to previous studies which asked participants to judge similarity when objects or scenes were rendered from a single viewpoint, we rendered Fribbles from multiple viewpoints and asked participants to judge shape similarity in a viewpoint-invariant manner. Metrics trained using pixel-based or DNN-based representations fail to explain our experimental data, but a metric trained with a viewpoint-invariant, part-based representation produces a good fit. We also find that although neural networks can learn to extract the part-based representation-and therefore should be capable of learning to model our data-networks trained with a "triplet loss" function based on similarity judgments do not perform well. We analyze this failure, providing a mathematical description of the relationship between the metric learning objective function and the triplet loss function. The poor performance of neural networks appears to be due to the nonconvexity of the optimization problem in network weight space. We conclude that viewpoint insensitivity is a critical aspect of human visual shape perception, and that neural network and other machine learning methods will need to learn viewpoint-insensitive representations in order to account for people's visual object shape similarity judgments.
Collapse
Affiliation(s)
- Joseph Scott German
- Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, United States.
| | - Robert A Jacobs
- Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, United States.
| |
Collapse
|
17
|
Abstract
Artificial vision has often been described as one of the key remaining challenges to be solved before machines can act intelligently. Recent developments in a branch of machine learning known as deep learning have catalyzed impressive gains in machine vision—giving a sense that the problem of vision is getting closer to being solved. The goal of this review is to provide a comprehensive overview of recent deep learning developments and to critically assess actual progress toward achieving human-level visual intelligence. I discuss the implications of the successes and limitations of modern machine vision algorithms for biological vision and the prospect for neuroscience to inform the design of future artificial vision systems.
Collapse
Affiliation(s)
- Thomas Serre
- Department of Cognitive Linguistic and Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, Rhode Island 02818, USA
| |
Collapse
|
18
|
Ayzenberg V, Lourenco SF. Skeletal descriptions of shape provide unique perceptual information for object recognition. Sci Rep 2019; 9:9359. [PMID: 31249321 PMCID: PMC6597715 DOI: 10.1038/s41598-019-45268-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 05/29/2019] [Indexed: 11/17/2022] Open
Abstract
With seemingly little effort, humans can both identify an object across large changes in orientation and extend category membership to novel exemplars. Although researchers argue that object shape is crucial in these cases, there are open questions as to how shape is represented for object recognition. Here we tested whether the human visual system incorporates a three-dimensional skeletal descriptor of shape to determine an object's identity. Skeletal models not only provide a compact description of an object's global shape structure, but also provide a quantitative metric by which to compare the visual similarity between shapes. Our results showed that a model of skeletal similarity explained the greatest amount of variance in participants' object dissimilarity judgments when compared with other computational models of visual similarity (Experiment 1). Moreover, parametric changes to an object's skeleton led to proportional changes in perceived similarity, even when controlling for another model of structure (Experiment 2). Importantly, participants preferentially categorized objects by their skeletons across changes to local shape contours and non-accidental properties (Experiment 3). Our findings highlight the importance of skeletal structure in vision, not only as a shape descriptor, but also as a diagnostic cue of object identity.
Collapse
|
19
|
Abstract
The human visual system reliably extracts shape information from complex natural scenes in spite of noise and fragmentation caused by clutter and occlusions. A fast, feedforward sweep through ventral stream involving mechanisms tuned for orientation, curvature, and local Gestalt principles produces partial shape representations sufficient for simpler discriminative tasks. More complete shape representations may involve recurrent processes that integrate local and global cues. While feedforward discriminative deep neural network models currently produce the best predictions of object selectivity in higher areas of the object pathway, a generative model may be required to account for all aspects of shape perception. Research suggests that a successful model will account for our acute sensitivity to four key perceptual dimensions of shape: topology, symmetry, composition, and deformation.
Collapse
Affiliation(s)
- James H Elder
- Centre for Vision Research, York University, Toronto, Ontario M3J 1P3, Canada;
| |
Collapse
|
20
|
Destler N, Singh M, Feldman J. Shape discrimination along morph-spaces. Vision Res 2019; 158:189-199. [PMID: 30878276 DOI: 10.1016/j.visres.2019.03.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Revised: 02/27/2019] [Accepted: 03/08/2019] [Indexed: 11/16/2022]
Abstract
We investigated the dimensions defining mental shape space, by measuring shape discrimination thresholds along "morph-spaces" defined by pairs of shapes. Given any two shapes, one can construct a morph-space by taking weighted averages of their boundary vertices (after normalization), creating a continuum of shapes ranging from the first shape to the second. Previous studies of morphs between highly familiar shape categories (e.g. truck and turkey) have shown elevated discrimination at category boundaries, reflecting a kind of "categorical perception" in shape space. Here, we use this technique to explore the underlying representation of unfamiliar shapes. Subjects were shown two shapes at nearby points along a morph-space, and asked to judge whether they were the same or different, with an adaptive procedure used to estimate discrimination thresholds at each point along the morph-space. We targeted several potentially important categorical distinctions, such one- vs. two-part shapes, two- vs. three-part shapes, changes in symmetry structure, and other potentially important distinctions. Observed discrimination thresholds showed substantial and systematic deviations from uniformity at different points along each shape continuum, meaning that subjects were consistently better at discriminating at certain points along each morph-space than at others. We introduce a shape similarity measure, based on Bayesian skeletal shape representations, which gives a good account of the observed variations in shape sensitivity.
Collapse
Affiliation(s)
- Nathan Destler
- Department of Psychology, Center for Cognitive Science, Rutgers University, New Brunswick, NJ, United States.
| | - Manish Singh
- Department of Psychology, Center for Cognitive Science, Rutgers University, New Brunswick, NJ, United States
| | - Jacob Feldman
- Department of Psychology, Center for Cognitive Science, Rutgers University, New Brunswick, NJ, United States
| |
Collapse
|
21
|
Comparing the Visual Representations and Performance of Humans and Deep Neural Networks. CURRENT DIRECTIONS IN PSYCHOLOGICAL SCIENCE 2018. [DOI: 10.1177/0963721418801342] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Although deep neural networks (DNNs) are state-of-the-art artificial intelligence systems, it is unclear what insights, if any, they provide about human intelligence. We address this issue in the domain of visual perception. After briefly describing DNNs, we provide an overview of recent results comparing human visual representations and performance with those of DNNs. In many cases, DNNs acquire visual representations and processing strategies that are very different from those used by people. We conjecture that there are at least two factors preventing them from serving as better psychological models. First, DNNs are currently trained with impoverished data, such as data lacking important visual cues to three-dimensional structure, data lacking multisensory statistical regularities, and data in which stimuli are unconnected to an observer’s actions and goals. Second, DNNs typically lack adaptations to capacity limits, such as attentional mechanisms, visual working memory, and compressed mental representations biased toward preserving task-relevant abstractions.
Collapse
|
22
|
Peterson JC, Abbott JT, Griffiths TL. Evaluating (and Improving) the Correspondence Between Deep Neural Networks and Human Representations. Cogn Sci 2018; 42:2648-2669. [DOI: 10.1111/cogs.12670] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Revised: 07/11/2018] [Accepted: 07/16/2018] [Indexed: 12/20/2022]
|
23
|
Kim J, Ricci M, Serre T. Not-So-CLEVR: learning same-different relations strains feedforward neural networks. Interface Focus 2018; 8:20180011. [PMID: 29951191 DOI: 10.1098/rsfs.2018.0011] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/08/2018] [Indexed: 11/12/2022] Open
Abstract
The advent of deep learning has recently led to great successes in various engineering applications. As a prime example, convolutional neural networks, a type of feedforward neural network, now approach human accuracy on visual recognition tasks like image classification and face recognition. However, here we will show that feedforward neural networks struggle to learn abstract visual relations that are effortlessly recognized by non-human primates, birds, rodents and even insects. We systematically study the ability of feedforward neural networks to learn to recognize a variety of visual relations and demonstrate that same-different visual relations pose a particular strain on these networks. Networks fail to learn same-different visual relations when stimulus variability makes rote memorization difficult. Further, we show that learning same-different problems becomes trivial for a feedforward network that is fed with perceptually grouped stimuli. This demonstration and the comparative success of biological vision in learning visual relations suggests that feedback mechanisms such as attention, working memory and perceptual grouping may be the key components underlying human-level abstract visual reasoning.
Collapse
Affiliation(s)
- Junkyung Kim
- Department of Cognitive, Linguistic & Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, RI 02912, USA
| | - Matthew Ricci
- Department of Cognitive, Linguistic & Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, RI 02912, USA
| | - Thomas Serre
- Department of Cognitive, Linguistic & Psychological Sciences, Carney Institute for Brain Science, Brown University, Providence, RI 02912, USA
| |
Collapse
|