1
|
Schüz S, Gatt A, Zarrieß S. Rethinking symbolic and visual context in Referring Expression Generation. Front Artif Intell 2023; 6:1067125. [PMID: 37026020 PMCID: PMC10072327 DOI: 10.3389/frai.2023.1067125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 02/28/2023] [Indexed: 03/31/2023] Open
Abstract
Situational context is crucial for linguistic reference to visible objects, since the same description can refer unambiguously to an object in one context but be ambiguous or misleading in others. This also applies to Referring Expression Generation (REG), where the production of identifying descriptions is always dependent on a given context. Research in REG has long represented visual domains throughsymbolicinformation about objects and their properties, to determine identifying sets of target features during content determination. In recent years, research invisual REGhas turned to neural modeling and recasted the REG task as an inherently multimodal problem, looking at more natural settings such as generating descriptions for objects in photographs. Characterizing the precise ways in which context influences generation is challenging in both paradigms, as context is notoriously lacking precise definitions and categorization. In multimodal settings, however, these problems are further exacerbated by the increased complexity and low-level representation of perceptual inputs. The main goal of this article is to provide a systematic review of the types and functions of visual context across various approaches to REG so far and to argue for integrating and extending different perspectives on visual context that currently co-exist in research on REG. By analyzing the ways in which symbolic REG integrates context in rule-based approaches, we derive a set of categories of contextual integration, including the distinction betweenpositiveandnegative semantic forcesexerted by context during reference generation. Using this as a framework, we show that so far existing work in visual REG has considered only some of the ways in which visual context can facilitate end-to-end reference generation. Connecting with preceding research in related areas, as possible directions for future research, we highlight some additional ways in which contextual integration can be incorporated into REG and other multimodal generation tasks.
Collapse
Affiliation(s)
- Simeon Schüz
- Faculty of Linguistics and Literary Studies, Bielefeld University, Bielefeld, Germany
- *Correspondence: Simeon Schüz
| | - Albert Gatt
- Natural Language Processing Group, Department of Information and Computing Sciences, Utrecht University, Utrecht, Netherlands
| | - Sina Zarrieß
- Faculty of Linguistics and Literary Studies, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
2
|
Stewart EEM, Ludwig CJH, Schütz AC. Humans represent the precision and utility of information acquired across fixations. Sci Rep 2022; 12:2411. [PMID: 35165336 PMCID: PMC8844410 DOI: 10.1038/s41598-022-06357-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 01/27/2022] [Indexed: 11/28/2022] Open
Abstract
Our environment contains an abundance of objects which humans interact with daily, gathering visual information using sequences of eye-movements to choose which object is best-suited for a particular task. This process is not trivial, and requires a complex strategy where task affordance defines the search strategy, and the estimated precision of the visual information gathered from each object may be used to track perceptual confidence for object selection. This study addresses the fundamental problem of how such visual information is metacognitively represented and used for subsequent behaviour, and reveals a complex interplay between task affordance, visual information gathering, and metacogntive decision making. People fixate higher-utility objects, and most importantly retain metaknowledge about how much information they have gathered about these objects, which is used to guide perceptual report choices. These findings suggest that such metacognitive knowledge is important in situations where decisions are based on information acquired in a temporal sequence.
Collapse
Affiliation(s)
- Emma E M Stewart
- Department of Experimental Psychology, Justus-Liebig University Giessen, Otto-Behaghel-Str. 10F, 35394, Giessen, Germany.
| | | | - Alexander C Schütz
- Allgemeine und Biologische Psychologie, Philipps-Universität Marburg, Marburg, Germany
- Center for Mind, Brain and Behaviour, Philipps-Universität Marburg, Marburg, Germany
| |
Collapse
|
3
|
Wang FS, Wolf J, Farshad M, Meboldt M, Lohmeyer Q. Object-Gaze Distance: Quantifying Near- Peripheral Gaze Behavior in Real-World Applications. J Eye Mov Res 2021; 14. [PMID: 34122747 PMCID: PMC8189527 DOI: 10.16910/jemr.14.1.5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Eye tracking (ET) has shown to reveal the wearer’s cognitive processes using the measurement
of the central point of foveal vision. However, traditional ET evaluation methods have
not been able to take into account the wearers’ use of the peripheral field of vision. We
propose an algorithmic enhancement to a state-of-the-art ET analysis method, the Object-
Gaze Distance (OGD), which additionally allows the quantification of near-peripheral gaze
behavior in complex real-world environments. The algorithm uses machine learning for area
of interest (AOI) detection and computes the minimal 2D Euclidean pixel distance to the
gaze point, creating a continuous gaze-based time-series. Based on an evaluation of two
AOIs in a real surgical procedure, the results show that a considerable increase of interpretable
fixation data from 23.8 % to 78.3 % of AOI screw and from 4.5 % to 67.2 % of AOI
screwdriver was achieved, when incorporating the near-peripheral field of vision. Additionally,
the evaluation of a multi-OGD time series representation has shown the potential to
reveal novel gaze patterns, which may provide a more accurate depiction of human gaze
behavior in multi-object environments.
Collapse
|
4
|
Vaidyanathan P, Prud'hommeaux E, Alm CO, Pelz JB. Computational framework for fusing eye movements and spoken narratives for image annotation. J Vis 2020; 20:13. [PMID: 32678878 PMCID: PMC7424957 DOI: 10.1167/jov.20.7.13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2019] [Accepted: 10/23/2019] [Indexed: 11/24/2022] Open
Abstract
Despite many recent advances in the field of computer vision, there remains a disconnect between how computers process images and how humans understand them. To begin to bridge this gap, we propose a framework that integrates human-elicited gaze and spoken language to label perceptually important regions in an image. Our work relies on the notion that gaze and spoken narratives can jointly model how humans inspect and analyze images. Using an unsupervised bitext alignment algorithm originally developed for machine translation, we create meaningful mappings between participants' eye movements over an image and their spoken descriptions of that image. The resulting multimodal alignments are then used to annotate image regions with linguistic labels. The accuracy of these labels exceeds that of baseline alignments obtained using purely temporal correspondence between fixations and words. We also find differences in system performances when identifying image regions using clustering methods that rely on gaze information rather than image features. The alignments produced by our framework can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. The framework can potentially be applied to any multimodal data stream and to any visual domain. To this end, we provide the research community with access to the computational framework.
Collapse
Affiliation(s)
| | | | - Cecilia O. Alm
- College of Liberal Arts, Rochester Institute of Technology, Rochester, NY, USA
| | - Jeff B. Pelz
- Chester F. Carlson Center for Imaging Science, Rochester Institute of Technology, Rochester, NY, USA
| |
Collapse
|
5
|
Koolen R. On Visually-Grounded Reference Production: Testing the Effects of Perceptual Grouping and 2D/3D Presentation Mode. Front Psychol 2019; 10:2247. [PMID: 31632326 PMCID: PMC6781859 DOI: 10.3389/fpsyg.2019.02247] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Accepted: 09/19/2019] [Indexed: 11/18/2022] Open
Abstract
When referring to a target object in a visual scene, speakers are assumed to consider certain distractor objects to be more relevant than others. The current research predicts that the way in which speakers come to a set of relevant distractors depends on how they perceive the distance between the objects in the scene. It reports on the results of two language production experiments, in which participants referred to target objects in photo-realistic visual scenes. Experiment 1 manipulated three factors that were expected to affect perceived distractor distance: two manipulations of perceptual grouping (region of space and type similarity), and one of presentation mode (2D vs. 3D). In line with most previous research on visually-grounded reference production, an offline measure of visual attention was taken here: the occurrence of overspecification with color. The results showed effects of region of space and type similarity on overspecification, suggesting that distractors that are perceived as being in the same group as the target are more often considered relevant distractors than distractors in a different group. Experiment 2 verified this suggestion with a direct measure of visual attention, eye tracking, and added a third manipulation of grouping: color similarity. For region of space in particular, the eye movements data indeed showed patterns in the expected direction: distractors within the same region as the target were fixated more often, and longer, than distractors in a different region. Color similarity was found to affect overspecification with color, but not gaze duration or the number of distractor fixations. Also the expected effects of presentation mode (2D vs. 3D) were not convincingly borne out by the data. Taken together, these results provide direct evidence for the close link between scene perception and language production, and indicate that perceptual grouping principles can guide speakers in determining the distractor set during reference production.
Collapse
Affiliation(s)
- Ruud Koolen
- Tilburg Center for Cognition and Communication, Tilburg University, Tilburg, Netherlands
| |
Collapse
|
6
|
Nowakowska A, Clarke ADF, Sahraie A, Hunt AR. Practice-related changes in eye movement strategy in healthy adults with simulated hemianopia. Neuropsychologia 2018; 128:232-240. [PMID: 29357279 DOI: 10.1016/j.neuropsychologia.2018.01.020] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 01/10/2018] [Accepted: 01/15/2018] [Indexed: 11/16/2022]
Abstract
The impact of visual field deficits such as hemianopia can be mitigated by eye movements that position the visual image within the intact visual field. Effective eye movement strategies are not observed in all patients, however, and it is not known whether persistent deficits are due to injury or to pre-existing individual differences. Here we examined whether repeated exposure to a search task with rewards for good performance would lead to better eye movement strategies in healthy individuals. Participants were exposed to simulated hemianopia during a search task in five testing sessions over five consecutive days and received monetary payment for improvements in search times. With practice, most participants made saccades that went further into the blind field earlier in search, specifically under conditions where little information about the target location would be gained by inspecting the sighted field. These changes in search strategy were correlated with reduced search times. This strategy improvement also generalised to a novel task, with better performance in naming objects in a photograph under conditions of simulated hemianopia after practice with visual search compared to a control group. However, even after five days, eye movements in most participants remained far from optimal. The results demonstrate the benefits, and limitations, of practice and reward in the development of effective coping strategies for visual field deficits.
Collapse
|
7
|
Helo A, Azaiez N, Rämä P. Word Processing in Scene Context: An Event-Related Potential Study in Young Children. Dev Neuropsychol 2017; 42:482-494. [PMID: 29178812 DOI: 10.1080/87565641.2017.1396604] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Semantic priming has been demonstrated in object or word contexts in toddlers. However, less is known about semantic priming in scene context. In this study, 24-month-olds with high and low vocabulary skills were presented with visual scenes (e.g., kitchen) followed by semantically consistent (e.g., spoon) or inconsistent (e.g., bed) spoken words. Inconsistent scene-word pairs evoked a larger N400 component over the frontal areas. Low-producers presented a larger N400 over the right while high-producers over the left frontal areas. Our results suggest that contextual information facilitates word processing in young children. Additionally, children with different linguistic skills activate different neural structures.
Collapse
Affiliation(s)
- A Helo
- a Laboratoire Psychologie de la Perception , Université Paris Descartes , Paris , France.,b Departamento de Fonoaudiología , Universidad de Chile , Santiago , Chile
| | - N Azaiez
- a Laboratoire Psychologie de la Perception , Université Paris Descartes , Paris , France.,c Department of Psychology , University of Jyväskylä , Jyväskylä , Finland
| | - P Rämä
- a Laboratoire Psychologie de la Perception , Université Paris Descartes , Paris , France.,d CNRS (UMR 8242) , Paris , France
| |
Collapse
|
8
|
Influence of semantic consistency and perceptual features on visual attention during scene viewing in toddlers. Infant Behav Dev 2017; 49:248-266. [DOI: 10.1016/j.infbeh.2017.09.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Revised: 09/14/2017] [Accepted: 09/16/2017] [Indexed: 11/20/2022]
|
9
|
Oliver WR. Effect of History and Context on Forensic Pathologist Interpretation of Photographs of Patterned Injury of the Skin. J Forensic Sci 2017; 62:1500-1505. [DOI: 10.1111/1556-4029.13449] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2016] [Revised: 12/28/2016] [Accepted: 12/30/2016] [Indexed: 11/28/2022]
|
10
|
Clarke ADF, Mahon A, Irvine A, Hunt AR. People are unable to recognize or report on their own eye movements. Q J Exp Psychol (Hove) 2016; 70:2251-2270. [PMID: 27595318 DOI: 10.1080/17470218.2016.1231208] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Eye movements bring new information into our visual system. The selection of each fixation is the result of a complex interplay of image features, task goals, and biases in motor control and perception. To what extent are we aware of the selection of saccades and their consequences? Here we use a converging methods approach to answer this question in three diverse experiments. In Experiment 1, participants were directed to find a target in a scene by a verbal description of it. We then presented the path the eyes took together with those of another participant. Participants could only identify their own path when the comparison scanpath was searching for a different target. In Experiment 2, participants viewed a scene for three seconds and then named objects from the scene. When asked whether they had looked directly at a given object, participants' responses were primarily determined by whether or not the object had been named, and not by whether it had been fixated. In Experiment 3, participants executed saccades towards single targets and then viewed a replay of either the eye movement they had just executed or that of someone else. Participants were at chance to identify their own saccade, even when it contained under- and overshoot corrections. The consistent inability to report on one's own eye movements across experiments suggests that awareness of eye movements is extremely impoverished or altogether absent. This is surprising given that information about prior eye movements is clearly used during visual search, motor error correction, and learning.
Collapse
Affiliation(s)
| | - Aoife Mahon
- a School of Psychology , University of Aberdeen , Aberdeen , UK
| | - Alex Irvine
- a School of Psychology , University of Aberdeen , Aberdeen , UK.,b Oxford Centre for Human Brain Activity, Department of Psychiatry , University of Oxford , Oxford , UK
| | - Amelia R Hunt
- a School of Psychology , University of Aberdeen , Aberdeen , UK
| |
Collapse
|
11
|
Baltaretu A, Krahmer EJ, van Wijk C, Maes A. Talking about Relations: Factors Influencing the Production of Relational Descriptions. Front Psychol 2016; 7:103. [PMID: 26903911 PMCID: PMC4746286 DOI: 10.3389/fpsyg.2016.00103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2015] [Accepted: 01/19/2016] [Indexed: 11/13/2022] Open
Abstract
In a production experiment (Experiment 1) and an acceptability rating one (Experiment 2), we assessed two factors, spatial position and salience, which may influence the production of relational descriptions (such as "the ball between the man and the drawer"). In Experiment 1, speakers were asked to refer unambiguously to a target object (a ball). In Experiment 1a, we addressed the role of spatial position, more specifically if speakers mention the entity positioned leftmost in the scene as (first) relatum. The results showed a small preference to start with the left entity, which leaves room for other factors that could influence spatial reference. Thus, in the following studies, we varied salience systematically, by making one of the relatum candidates animate (Experiment 1b), and by adding attention capture cues, first subliminally by priming one relatum candidate with a flash (Experiment 1c), then explicitly by using salient colors for objects (Experiment 1d). Results indicate that spatial position played a dominant role. Entities on the left were mentioned more often as (first) relatum than those on the right (Experiments 1a-d). Animacy affected reference production in one out of three studies (in Experiment 1d). When salience was manipulated by priming visual attention or by using salient colors, there were no significant effects (Experiments 1c, d). In the acceptability rating study (Experiment 2), participants expressed their preference for specific relata, by ranking descriptions on the basis of how good they thought the descriptions fitted the scene. Results show that participants preferred most the description that had an animate entity as the first mentioned relatum. The relevance of these results for models of reference production is discussed.
Collapse
Affiliation(s)
- Adriana Baltaretu
- Tilburg Center for Cognition and Communication, Tilburg University Tilburg, Netherlands
| | - Emiel J Krahmer
- Tilburg Center for Cognition and Communication, Tilburg University Tilburg, Netherlands
| | - Carel van Wijk
- Tilburg Center for Cognition and Communication, Tilburg University Tilburg, Netherlands
| | - Alfons Maes
- Tilburg Center for Cognition and Communication, Tilburg University Tilburg, Netherlands
| |
Collapse
|
12
|
Koolen R, Krahmer E, Swerts M. How Distractor Objects Trigger Referential Overspecification: Testing the Effects of Visual Clutter and Distractor Distance. Cogn Sci 2015; 40:1617-1647. [PMID: 26432277 DOI: 10.1111/cogs.12297] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2013] [Revised: 06/19/2015] [Accepted: 07/02/2015] [Indexed: 11/28/2022]
Abstract
In two experiments, we investigate to what extent various visual saliency cues in realistic visual scenes cause speakers to overspecify their definite object descriptions with a redundant color attribute. The results of the first experiment demonstrate that speakers are more likely to redundantly mention color when visual clutter is present in a scene as compared to when this is not the case. In the second experiment, we found that distractor type and distractor color affect redundant color use: Speakers are most likely to overspecify if there is at least one distractor object present that has the same type, but a different color than the target referent. Reliable effects of distractor distance were not found. Taken together, our results suggest that certain visual saliency cues guide speakers in determining which objects in a visual scene are relevant distractors, and which not. We argue that this is problematic for algorithms that aim to generate human-like descriptions of objects (such as the Incremental Algorithm), since these generally select properties that help to distinguish a target from all objects that are present in a scene.
Collapse
Affiliation(s)
- Ruud Koolen
- Tilburg Center for Cognition and Communication (TiCC), School of Humanities, Tilburg University.
| | - Emiel Krahmer
- Tilburg Center for Cognition and Communication (TiCC), School of Humanities, Tilburg University
| | - Marc Swerts
- Tilburg Center for Cognition and Communication (TiCC), School of Humanities, Tilburg University
| |
Collapse
|
13
|
Orquin JL, Ashby NJS, Clarke ADF. Areas of Interest as a Signal Detection Problem in Behavioral Eye-Tracking Research. JOURNAL OF BEHAVIORAL DECISION MAKING 2015. [DOI: 10.1002/bdm.1867] [Citation(s) in RCA: 73] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Jacob L. Orquin
- Department of Business Administration/MAPP; Aarhus University; Aarhus Denmark
| | - Nathaniel J. S. Ashby
- Department of Social and Decision Sciences; Carnegie Mellon University; Pittsburgh PA USA
| | | |
Collapse
|
14
|
Deriving an appropriate baseline for describing fixation behaviour. Vision Res 2014; 102:41-51. [PMID: 25080387 DOI: 10.1016/j.visres.2014.06.016] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2014] [Revised: 05/16/2014] [Accepted: 06/02/2014] [Indexed: 11/21/2022]
Abstract
Humans display image-independent viewing biases when inspecting complex scenes. One of the strongest such bias is the central tendency in scene viewing: observers favour making fixations towards the centre of an image, irrespective of its content. Characterising these biases accurately is important for three reasons: (1) they provide a necessary baseline for quantifying the association between visual features in scenes and fixation selection; (2) they provide a benchmark for evaluating models of fixation behaviour when viewing scenes; and (3) they can be included as a component of generative models of eye guidance. In the present study we compare four commonly used approaches to describing image-independent biases and report their ability to describe observed data and correctly classify fixations across 10 eye movement datasets. We propose an anisotropic Gaussian function that can serve as an effective and appropriate baseline for describing image-independent biases without the need to fit functions to individual datasets or subjects.
Collapse
|