1
|
Ding R, Ten Oever S, Martin AE. Delta-band Activity Underlies Referential Meaning Representation during Pronoun Resolution. J Cogn Neurosci 2024; 36:1472-1492. [PMID: 38652108 DOI: 10.1162/jocn_a_02163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Human language offers a variety of ways to create meaning, one of which is referring to entities, objects, or events in the world. One such meaning maker is understanding to whom or to what a pronoun in a discourse refers to. To understand a pronoun, the brain must access matching entities or concepts that have been encoded in memory from previous linguistic context. Models of language processing propose that internally stored linguistic concepts, accessed via exogenous cues such as phonological input of a word, are represented as (a)synchronous activities across a population of neurons active at specific frequency bands. Converging evidence suggests that delta band activity (1-3 Hz) is involved in temporal and representational integration during sentence processing. Moreover, recent advances in the neurobiology of memory suggest that recollection engages neural dynamics similar to those which occurred during memory encoding. Integrating from these two research lines, we here tested the hypothesis that neural dynamic patterns, especially in delta frequency range, underlying referential meaning representation, would be reinstated during pronoun resolution. By leveraging neural decoding techniques (i.e., representational similarity analysis) on a magnetoencephalogram data set acquired during a naturalistic story-listening task, we provide evidence that delta-band activity underlies referential meaning representation. Our findings suggest that, during spoken language comprehension, endogenous linguistic representations such as referential concepts may be proactively retrieved and represented via activation of their underlying dynamic neural patterns.
Collapse
Affiliation(s)
- Rong Ding
- Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
| | - Sanne Ten Oever
- Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Radboud University Donders Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands
| | - Andrea E Martin
- Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Radboud University Donders Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands
| |
Collapse
|
2
|
Crinnion AM, Luthra S, Gaston P, Magnuson JS. Resolving competing predictions in speech: How qualitatively different cues and cue reliability contribute to phoneme identification. Atten Percept Psychophys 2024; 86:942-961. [PMID: 38383914 DOI: 10.3758/s13414-024-02849-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/17/2024] [Indexed: 02/23/2024]
Abstract
Listeners have many sources of information available in interpreting speech. Numerous theoretical frameworks and paradigms have established that various constraints impact the processing of speech sounds, but it remains unclear how listeners might simultaneously consider multiple cues, especially those that differ qualitatively (i.e., with respect to timing and/or modality) or quantitatively (i.e., with respect to cue reliability). Here, we establish that cross-modal identity priming can influence the interpretation of ambiguous phonemes (Exp. 1, N = 40) and show that two qualitatively distinct cues - namely, cross-modal identity priming and auditory co-articulatory context - have additive effects on phoneme identification (Exp. 2, N = 40). However, we find no effect of quantitative variation in a cue - specifically, changes in the reliability of the priming cue did not influence phoneme identification (Exp. 3a, N = 40; Exp. 3b, N = 40). Overall, we find that qualitatively distinct cues can additively influence phoneme identification. While many existing theoretical frameworks address constraint integration to some degree, our results provide a step towards understanding how information that differs in both timing and modality is integrated in online speech perception.
Collapse
Affiliation(s)
| | | | | | - James S Magnuson
- University of Connecticut, Storrs, CT, USA
- BCBL. Basque Center on Cognition, Brain and Language, Donostia-San Sebastián, Spain
- Ikerbasque. Basque Foundation for Science, Bilbao, Spain
| |
Collapse
|
3
|
Luthra S. Why are listeners hindered by talker variability? Psychon Bull Rev 2024; 31:104-121. [PMID: 37580454 PMCID: PMC10864679 DOI: 10.3758/s13423-023-02355-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/27/2023] [Indexed: 08/16/2023]
Abstract
Though listeners readily recognize speech from a variety of talkers, accommodating talker variability comes at a cost: Myriad studies have shown that listeners are slower to recognize a spoken word when there is talker variability compared with when talker is held constant. This review focuses on two possible theoretical mechanisms for the emergence of these processing penalties. One view is that multitalker processing costs arise through a resource-demanding talker accommodation process, wherein listeners compare sensory representations against hypothesized perceptual candidates and error signals are used to adjust the acoustic-to-phonetic mapping (an active control process known as contextual tuning). An alternative proposal is that these processing costs arise because talker changes involve salient stimulus-level discontinuities that disrupt auditory attention. Some recent data suggest that multitalker processing costs may be driven by both mechanisms operating over different time scales. Fully evaluating this claim requires a foundational understanding of both talker accommodation and auditory streaming; this article provides a primer on each literature and also reviews several studies that have observed multitalker processing costs. The review closes by underscoring a need for comprehensive theories of speech perception that better integrate auditory attention and by highlighting important considerations for future research in this area.
Collapse
Affiliation(s)
- Sahil Luthra
- Department of Psychology, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213, USA.
| |
Collapse
|
4
|
Tzeng CY, Russell ML, Nygaard LC. Attention modulates perceptual learning of non-native-accented speech. Atten Percept Psychophys 2024; 86:339-353. [PMID: 37872434 DOI: 10.3758/s13414-023-02790-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/11/2023] [Indexed: 10/25/2023]
Abstract
Listeners readily adapt to variation in non-native-accented speech, learning to disambiguate between talker-specific and accent-based variation. We asked (1) which linguistic and indexical features of the spoken utterance are relevant for this learning to occur and (2) whether task-driven attention to these features affects the extent to which learning generalizes to novel utterances and voices. In two experiments, listeners heard English sentences (Experiment 1) or words (Experiment 2) produced by Spanish-accented talkers during an exposure phase. Listeners' attention was directed to lexical content (transcription), indexical cues (talker identification), or both (transcription + talker identification). In Experiment 1, listeners' test transcription of novel English sentences spoken by Spanish-accented talkers showed generalized perceptual learning to previously unheard voices and utterances for all training conditions. In Experiment 2, generalized learning occurred only in the transcription + talker identification condition, suggesting that attention to both linguistic and indexical cues optimizes listeners' ability to distinguish between individual talker- and group-based variation, especially with the reduced availability of sentence-length prosodic information. Collectively, these findings highlight the role of attentional processes in the encoding of speech input and underscore the interdependency of indexical and lexical characteristics in spoken language processing.
Collapse
Affiliation(s)
- Christina Y Tzeng
- Department of Psychology, San José State University, 1 Washington Sq, San José, CA, 95192, USA.
| | - Marissa L Russell
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, MA, USA
| | - Lynne C Nygaard
- Department of Psychology, Emory University, Atlanta, GA, USA
| |
Collapse
|
5
|
Aoki NB, Zellou G. Visual information affects adaptation to novel talkers: Ethnicity-specific and ethnicity-independent learning of L2-accented speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 154:2290-2304. [PMID: 37843380 DOI: 10.1121/10.0021289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 09/13/2023] [Indexed: 10/17/2023]
Abstract
Prior work demonstrates that exposure to speakers of the same accent facilitates comprehension of a novel talker with the same accent (accent-specific learning). Moreover, exposure to speakers of multiple different accents enhances understanding of a talker with a novel accent (accent-independent learning). Although bottom-up acoustic information about accent constrains adaptation to novel talkers, the effect of top-down social information remains unclear. The current study examined effects of apparent ethnicity on adaptation to novel L2-accented ("non-native") talkers while keeping bottom-up information constant. Native English listeners transcribed sentences in noise for three Mandarin-accented English speakers and then a fourth (novel) Mandarin-accented English speaker. Transcription accuracy of the novel talker improves when: all speakers are presented with east Asian faces (ethnicity-specific learning); the exposure speakers are paired with different, non-east Asian ethnicities and the novel talker has an east Asian face (ethnicity-independent learning). However, accuracy does not improve when all speakers have White faces or when the exposure speakers have White faces and the test talker has an east Asian face. This study demonstrates that apparent ethnicity affects adaptation to novel L2-accented talkers, thus underscoring the importance of social expectations in perceptual learning and cross-talker generalization.
Collapse
Affiliation(s)
- Nicholas B Aoki
- Department of Linguistics, University of California Davis, Davis, California 95616, USA
| | - Georgia Zellou
- Department of Linguistics, University of California Davis, Davis, California 95616, USA
| |
Collapse
|
6
|
Xie X, Jaeger TF, Kurumada C. What we do (not) know about the mechanisms underlying adaptive speech perception: A computational framework and review. Cortex 2023; 166:377-424. [PMID: 37506665 DOI: 10.1016/j.cortex.2023.05.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 12/23/2022] [Accepted: 05/05/2023] [Indexed: 07/30/2023]
Abstract
Speech from unfamiliar talkers can be difficult to comprehend initially. These difficulties tend to dissipate with exposure, sometimes within minutes or less. Adaptivity in response to unfamiliar input is now considered a fundamental property of speech perception, and research over the past two decades has made substantial progress in identifying its characteristics. The mechanisms underlying adaptive speech perception, however, remain unknown. Past work has attributed facilitatory effects of exposure to any one of three qualitatively different hypothesized mechanisms: (1) low-level, pre-linguistic, signal normalization, (2) changes in/selection of linguistic representations, or (3) changes in post-perceptual decision-making. Direct comparisons of these hypotheses, or combinations thereof, have been lacking. We describe a general computational framework for adaptive speech perception (ASP) that-for the first time-implements all three mechanisms. We demonstrate how the framework can be used to derive predictions for experiments on perception from the acoustic properties of the stimuli. Using this approach, we find that-at the level of data analysis presently employed by most studies in the field-the signature results of influential experimental paradigms do not distinguish between the three mechanisms. This highlights the need for a change in research practices, so that future experiments provide more informative results. We recommend specific changes to experimental paradigms and data analysis. All data and code for this study are shared via OSF, including the R markdown document that this article is generated from, and an R library that implements the models we present.
Collapse
Affiliation(s)
- Xin Xie
- Language Science, University of California, Irvine, USA.
| | - T Florian Jaeger
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA; Computer Science, University of Rochester, Rochester, NY, USA
| | - Chigusa Kurumada
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA
| |
Collapse
|
7
|
Persson A, Jaeger TF. Evaluating normalization accounts against the dense vowel space of Central Swedish. Front Psychol 2023; 14:1165742. [PMID: 37416548 PMCID: PMC10322199 DOI: 10.3389/fpsyg.2023.1165742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 05/23/2023] [Indexed: 07/08/2023] Open
Abstract
Talkers vary in the phonetic realization of their vowels. One influential hypothesis holds that listeners overcome this inter-talker variability through pre-linguistic auditory mechanisms that normalize the acoustic or phonetic cues that form the input to speech recognition. Dozens of competing normalization accounts exist-including both accounts specific to vowel perception and general purpose accounts that can be applied to any type of cue. We add to the cross-linguistic literature on this matter by comparing normalization accounts against a new phonetically annotated vowel database of Swedish, a language with a particularly dense vowel inventory of 21 vowels differing in quality and quantity. We evaluate normalization accounts on how they differ in predicted consequences for perception. The results indicate that the best performing accounts either center or standardize formants by talker. The study also suggests that general purpose accounts perform as well as vowel-specific accounts, and that vowel normalization operates in both temporal and spectral domains.
Collapse
Affiliation(s)
- Anna Persson
- Department of Swedish Language and Multilingualism, Stockholm University, Stockholm, Sweden
| | - T. Florian Jaeger
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, United States
- Computer Science, University of Rochester, Rochester, NY, United States
| |
Collapse
|
8
|
Luthra S, Mechtenberg H, Giorio C, Theodore RM, Magnuson JS, Myers EB. Using TMS to evaluate a causal role for right posterior temporal cortex in talker-specific phonetic processing. BRAIN AND LANGUAGE 2023; 240:105264. [PMID: 37087863 PMCID: PMC10286152 DOI: 10.1016/j.bandl.2023.105264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 04/06/2023] [Accepted: 04/08/2023] [Indexed: 05/03/2023]
Abstract
Theories suggest that speech perception is informed by listeners' beliefs of what phonetic variation is typical of a talker. A previous fMRI study found right middle temporal gyrus (RMTG) sensitivity to whether a phonetic variant was typical of a talker, consistent with literature suggesting that the right hemisphere may play a key role in conditioning phonetic identity on talker information. The current work used transcranial magnetic stimulation (TMS) to test whether the RMTG plays a causal role in processing talker-specific phonetic variation. Listeners were exposed to talkers who differed in how they produced voiceless stop consonants while TMS was applied to RMTG, left MTG, or scalp vertex. Listeners subsequently showed near-ceiling performance in indicating which of two variants was typical of a trained talker, regardless of previous stimulation site. Thus, even though the RMTG is recruited for talker-specific phonetic processing, modulation of its function may have only modest consequences.
Collapse
Affiliation(s)
| | | | | | | | - James S Magnuson
- University of Connecticut, United States; BCBL. Basque Center on Cognition Brain and Language, Donostia-San Sebastián, Spain; Ikerbasque, Basque Foundation for Science, Bilbao, Spain
| | | |
Collapse
|
9
|
Luthra S, Magnuson JS, Myers EB. Right Posterior Temporal Cortex Supports Integration of Phonetic and Talker Information. NEUROBIOLOGY OF LANGUAGE (CAMBRIDGE, MASS.) 2023; 4:145-177. [PMID: 37229142 PMCID: PMC10205075 DOI: 10.1162/nol_a_00091] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Accepted: 11/08/2022] [Indexed: 05/27/2023]
Abstract
Though the right hemisphere has been implicated in talker processing, it is thought to play a minimal role in phonetic processing, at least relative to the left hemisphere. Recent evidence suggests that the right posterior temporal cortex may support learning of phonetic variation associated with a specific talker. In the current study, listeners heard a male talker and a female talker, one of whom produced an ambiguous fricative in /s/-biased lexical contexts (e.g., epi?ode) and one who produced it in /∫/-biased contexts (e.g., friend?ip). Listeners in a behavioral experiment (Experiment 1) showed evidence of lexically guided perceptual learning, categorizing ambiguous fricatives in line with their previous experience. Listeners in an fMRI experiment (Experiment 2) showed differential phonetic categorization as a function of talker, allowing for an investigation of the neural basis of talker-specific phonetic processing, though they did not exhibit perceptual learning (likely due to characteristics of our in-scanner headphones). Searchlight analyses revealed that the patterns of activation in the right superior temporal sulcus (STS) contained information about who was talking and what phoneme they produced. We take this as evidence that talker information and phonetic information are integrated in the right STS. Functional connectivity analyses suggested that the process of conditioning phonetic identity on talker information depends on the coordinated activity of a left-lateralized phonetic processing system and a right-lateralized talker processing system. Overall, these results clarify the mechanisms through which the right hemisphere supports talker-specific phonetic processing.
Collapse
Affiliation(s)
- Sahil Luthra
- Department of Psychological Sciences, University of Connecticut, Storrs, CT, USA
| | - James S. Magnuson
- Department of Psychological Sciences, University of Connecticut, Storrs, CT, USA
- Basque Center on Cognition Brain and Language (BCBL), Donostia-San Sebastián, Spain
- Ikerbasque, Basque Foundation for Science, Bilbao, Spain
| | - Emily B. Myers
- Department of Psychological Sciences, University of Connecticut, Storrs, CT, USA
- Speech, Language, and Hearing Sciences, University of Connecticut, Storrs, CT, USA
| |
Collapse
|
10
|
Novotny M, Cmejla R, Tykalova T. Automated prediction of children's age from voice acoustics. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
11
|
Hearing is believing: Lexically guided perceptual learning is graded to reflect the quantity of evidence in speech input. Cognition 2023; 235:105404. [PMID: 36812836 DOI: 10.1016/j.cognition.2023.105404] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 11/29/2022] [Accepted: 02/07/2023] [Indexed: 02/22/2023]
Abstract
There is wide variability in the acoustic patterns that are produced for a given linguistic message, including variability that is conditioned on who is speaking. Listeners solve this lack of invariance problem, at least in part, by dynamically modifying the mapping to speech sounds in response to structured variation in the input. Here we test a primary tenet of the ideal adapter framework of speech adaptation, which posits that perceptual learning reflects the incremental updating of cue-sound mappings to incorporate observed evidence with prior beliefs. Our investigation draws on the influential lexically guided perceptual learning paradigm. During an exposure phase, listeners heard a talker who produced fricative energy ambiguous between /ʃ/ and /s/. Lexical context differentially biased interpretation of the ambiguity as either /s/ or /ʃ/, and, across two behavioral experiments (n = 500), we manipulated the quantity of evidence and the consistency of evidence that was provided during exposure. Following exposure, listeners categorized tokens from an ashi - asi continuum to assess learning. The ideal adapter framework was formalized through computational simulations, which predicted that learning would be graded to reflect the quantity, but not the consistency, of the exposure input. These predictions were upheld in human listeners; the magnitude of the learning effect monotonically increased given exposure to four, 10, or 20 critical productions, and there was no evidence that learning differed given consistent versus inconsistent exposure. These results (1) provide support for a primary tenet of the ideal adapter framework, (2) establish quantity of evidence as a key determinant of adaptation in human listeners, and (3) provide critical evidence that lexically guided perceptual learning is not a binary outcome. In doing so, the current work provides foundational knowledge to support theoretical advances that consider perceptual learning as a graded outcome that is tightly linked to input statistics in the speech stream.
Collapse
|
12
|
Kapadia AM, Tin JAA, Perrachione TK. Multiple sources of acoustic variation affect speech processing efficiency. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:209. [PMID: 36732274 PMCID: PMC9836727 DOI: 10.1121/10.0016611] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 11/14/2022] [Accepted: 12/07/2022] [Indexed: 05/29/2023]
Abstract
Phonetic variability across talkers imposes additional processing costs during speech perception, evident in performance decrements when listening to speech from multiple talkers. However, within-talker phonetic variation is a less well-understood source of variability in speech, and it is unknown how processing costs from within-talker variation compare to those from between-talker variation. Here, listeners performed a speeded word identification task in which three dimensions of variability were factorially manipulated: between-talker variability (single vs multiple talkers), within-talker variability (single vs multiple acoustically distinct recordings per word), and word-choice variability (two- vs six-word choices). All three sources of variability led to reduced speech processing efficiency. Between-talker variability affected both word-identification accuracy and response time, but within-talker variability affected only response time. Furthermore, between-talker variability, but not within-talker variability, had a greater impact when the target phonological contrasts were more similar. Together, these results suggest that natural between- and within-talker variability reflect two distinct magnitudes of common acoustic-phonetic variability: Both affect speech processing efficiency, but they appear to have qualitatively and quantitatively unique effects due to differences in their potential to obscure acoustic-phonemic correspondences across utterances.
Collapse
Affiliation(s)
- Alexandra M Kapadia
- Department of Speech, Language, and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| | - Jessica A A Tin
- Department of Speech, Language, and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| | - Tyler K Perrachione
- Department of Speech, Language, and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| |
Collapse
|
13
|
Perceptual learning of multiple talkers: Determinants, characteristics, and limitations. Atten Percept Psychophys 2022; 84:2335-2359. [PMID: 36076119 DOI: 10.3758/s13414-022-02556-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/08/2022] [Indexed: 11/08/2022]
Abstract
Research suggests that listeners simultaneously update talker-specific generative models to reflect structured phonetic variation. Because past investigations exposed listeners to talkers of different genders, it is unknown whether adaptation is talker specific or rather linked to a broader sociophonetic class. Here, we test determinants of listeners' ability to update and apply talker-specific models for speech perception. In six experiments (n = 480), listeners were first exposed to the speech of two talkers who produced ambiguous fricative energy. The talkers' speech was interleaved during exposure, and lexical context differentially biased interpretation of the ambiguity as either /s/ or /ʃ/ for each talker. At test, listeners categorized tokens from ashi-asi continua, one for each talker. Across conditions and experiments, we manipulated exposure quantity, talker gender, blocked versus interleaved talker structure at test, and the degree to which fricative acoustics differed between talkers. When test was blocked by talker, learning was observed for different but not same gender talkers. When talkers were interleaved at test, learning was observed for both different and same gender talkers, which was attenuated when fricative acoustics were constant across talkers. There was no strong evidence to suggest that adaptation to multiple talkers required increased quantity of exposure beyond that required to adapt to a single talker. These results suggest that perceptual learning for speech is achieved via a mechanism that represents a context-dependent, cumulative integration of experience with speech input and identity critical constraints on listeners' ability to dynamically apply multiple generative models in mixed talker listening environments.
Collapse
|
14
|
Ip MHK, Cutler A. In Search of Salience: Focus Detection in the Speech of Different Talkers. LANGUAGE AND SPEECH 2022; 65:650-680. [PMID: 34841933 DOI: 10.1177/00238309211046029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Many different prosodic cues can help listeners predict upcoming speech. However, no research to date has assessed listeners' processing of preceding prosody from different speakers. The present experiments examine (1) whether individual speakers (of the same language variety) are likely to vary in their production of preceding prosody; (2) to the extent that there is talker variability, whether listeners are flexible enough to use any prosodic cues signaled by the individual speaker; and (3) whether types of prosodic cues (e.g., F0 versus duration) vary in informativeness. Using a phoneme-detection task, we examined whether listeners can entrain to different combinations of preceding prosodic cues to predict where focus will fall in an utterance. We used unsynthesized sentences recorded by four female native speakers of Australian English who happened to have used different preceding cues to produce sentences with prosodic focus: a combination of pre-focus overall duration cues, F0 and intensity (mean, maximum, range), and longer pre-target interval before the focused word onset (Speaker 1), only mean F0 cues, mean and maximum intensity, and longer pre-target interval (Speaker 2), only pre-target interval duration (Speaker 3), and only pre-focus overall duration and maximum intensity (Speaker 4). Listeners could entrain to almost every speaker's cues (the exception being Speaker 4's use of only pre-focus overall duration and maximum intensity), and could use whatever cues were available even when one of the cue sources was rendered uninformative. Our findings demonstrate both speaker variability and listener flexibility in the processing of prosodic focus.
Collapse
Affiliation(s)
- Martin Ho Kwan Ip
- The MARCS Institute, Western Sydney University, Australia
- ARC Centre of Excellence for the Dynamics of Language, Australia
| | - Anne Cutler
- The MARCS Institute, Western Sydney University, Australia
- ARC Centre of Excellence for the Dynamics of Language, Australia
| |
Collapse
|
15
|
Si C, Zhang C, Lau P, Yang Y, Li B. Modelling representations in speech normalization of prosodic cues. Sci Rep 2022; 12:14635. [PMID: 36030274 PMCID: PMC9420126 DOI: 10.1038/s41598-022-18838-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 08/22/2022] [Indexed: 12/02/2022] Open
Abstract
The lack of invariance problem in speech perception refers to a fundamental problem of how listeners deal with differences of speech sounds produced by various speakers. The current study is the first to test the contributions of mentally stored distributional information in normalization of prosodic cues. This study starts out by modelling distributions of acoustic cues from a speech corpus. We proceeded to conduct three experiments using both naturally produced lexical tones with estimated distributions and manipulated lexical tones with f0 values generated from simulated distributions. State of the art statistical techniques have been used to examine the effects of distribution parameters in normalization and identification curves with respect to each parameter. Based on the significant effects of distribution parameters, we proposed a probabilistic parametric representation (PPR), integrating knowledge from previously established distributions of speakers with their indexical information. PPR is still accessed during speech perception even when contextual information is present. We also discussed the procedure of normalization of speech signals produced by unfamiliar talker with and without contexts and the access of long-term stored representations.
Collapse
Affiliation(s)
- Chen Si
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China. .,Hong Kong Polytechnic University-Peking University Research Centre on Chinese Linguistics, Kowloon, Hong Kong SAR, China. .,Research Centre for Language, Cognition, and Neuroscience, University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China.
| | - Caicai Zhang
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China.,Hong Kong Polytechnic University-Peking University Research Centre on Chinese Linguistics, Kowloon, Hong Kong SAR, China.,Research Centre for Language, Cognition, and Neuroscience, University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Puiyin Lau
- Department of Statistics and Actuarial Science, University of Hong Kong, Pok Fu Lam, Hong Kong SAR, China
| | - Yike Yang
- Department of Chinese Language and Literature, Hong Kong Shue Yan University, North Point, Hong Kong SAR, China
| | - Bei Li
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China
| |
Collapse
|
16
|
Nenadić F, Tucker BV, Ten Bosch L. Computational Modeling of an Auditory Lexical Decision Experiment Using DIANA. LANGUAGE AND SPEECH 2022:238309221111752. [PMID: 36000386 PMCID: PMC10394956 DOI: 10.1177/00238309221111752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We present an implementation of DIANA, a computational model of spoken word recognition, to model responses collected in the Massive Auditory Lexical Decision (MALD) project. DIANA is an end-to-end model, including an activation and decision component that takes the acoustic signal as input, activates internal word representations, and outputs lexicality judgments and estimated response latencies. Simulation 1 presents the process of creating acoustic models required by DIANA to analyze novel speech input. Simulation 2 investigates DIANA's performance in determining whether the input signal is a word present in the lexicon or a pseudoword. In Simulation 3, we generate estimates of response latency and correlate them with general tendencies in participant responses in MALD data. We find that DIANA performs fairly well in free word recognition and lexical decision. However, the current approach for estimating response latency provides estimates opposite to those found in behavioral data. We discuss these findings and offer suggestions as to what a contemporary model of spoken word recognition should be able to do.
Collapse
Affiliation(s)
- Filip Nenadić
- University of Alberta, Canada; Singidunum University, Serbia
| | | | | |
Collapse
|
17
|
Murthy SK, Griffiths TL, Hawkins RD. Shades of confusion: Lexical uncertainty modulates ad hoc coordination in an interactive communication task. Cognition 2022; 225:105152. [DOI: 10.1016/j.cognition.2022.105152] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Revised: 02/07/2022] [Accepted: 04/26/2022] [Indexed: 11/03/2022]
|
18
|
Social Priming in Speech Perception: Revisiting Kangaroo/Kiwi Priming in New Zealand English. Brain Sci 2022; 12:brainsci12060684. [PMID: 35741570 PMCID: PMC9221372 DOI: 10.3390/brainsci12060684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 05/18/2022] [Accepted: 05/20/2022] [Indexed: 12/10/2022] Open
Abstract
We investigate whether regionally-associated primes can affect speech perception in two lexical decision tasks in which New Zealand listeners were exposed to an Australian prime (a kangaroo), a New Zealand prime (a kiwi), and/or a control animal (a horse). The target stimuli involve ambiguous vowels, embedded in a frame that would result in a real word with a KIT or a DRESS vowel and a nonsense word with the alternative vowel; thus, lexical decision responses can reveal which vowel was heard. Our pre-registered design predicted that exposure to the kangaroo would elicit more KIT-consistent responses than exposure to the kiwi. Both experiments showed significant priming effects in which the kangaroo elicited more KIT-consistent responses than the kiwi. The particular locus and details of these effects differed across experiments and participants. Taken together, the experiments reinforce the finding that regionally-associated primes can affect speech perception, but also suggest that the effects are sensitive to experimental design, stimulus acoustics, and individuals’ production and past experience.
Collapse
|
19
|
Heffner CC, Fuhrmeister P, Luthra S, Mechtenberg H, Saltzman D, Myers EB. Reliability and validity for perceptual flexibility in speech. BRAIN AND LANGUAGE 2022; 226:105070. [PMID: 35026449 DOI: 10.1016/j.bandl.2021.105070] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 12/10/2021] [Accepted: 12/31/2021] [Indexed: 06/08/2023]
Abstract
The study of perceptual flexibility in speech depends on a variety of tasks that feature a large degree of variability between participants. Of critical interest is whether measures are consistent within an individual or across stimulus contexts. This is particularly key for individual difference designs that aredeployed to examine the neural basis or clinical consequences of perceptual flexibility. In the present set of experiments, we assess the split-half reliability and construct validity of five measures of perceptual flexibility: three of learning in a native language context (e.g., understanding someone with a foreign accent) and two of learning in a non-native context (e.g., learning to categorize non-native speech sounds). We find that most of these tasks show an appreciable level of split-half reliability, although construct validity was sometimes weak. This provides good evidence for reliability for these tasks, while highlighting possible upper limits on expected effect sizes involving each measure.
Collapse
Affiliation(s)
- Christopher C Heffner
- Department of Speech, Language, and Hearing Sciences, University of Connecticut, Storrs, CT 06269, United States; Institute for the Brain and Cognitive Sciences, University of Connecticut, Storrs, CT 06269, United States; Department of Communicative Disorders and Sciences, University at Buffalo, Buffalo, NY 14214, United States.
| | - Pamela Fuhrmeister
- Department of Speech, Language, and Hearing Sciences, University of Connecticut, Storrs, CT 06269, United States; Department of Linguistics, University of Potsdam, 11476 Potsdam, Germany
| | - Sahil Luthra
- Institute for the Brain and Cognitive Sciences, University of Connecticut, Storrs, CT 06269, United States; Department of Psychological Sciences, University of Connecticut, Storrs, CT 06269, United States
| | - Hannah Mechtenberg
- Department of Psychological Sciences, University of Connecticut, Storrs, CT 06269, United States
| | - David Saltzman
- Department of Psychological Sciences, University of Connecticut, Storrs, CT 06269, United States
| | - Emily B Myers
- Department of Speech, Language, and Hearing Sciences, University of Connecticut, Storrs, CT 06269, United States; Institute for the Brain and Cognitive Sciences, University of Connecticut, Storrs, CT 06269, United States; Department of Psychological Sciences, University of Connecticut, Storrs, CT 06269, United States
| |
Collapse
|
20
|
Woodard K, Plate RC, Pollak SD. Children track probabilistic distributions of facial cues across individuals. J Exp Psychol Gen 2022; 151:506-511. [PMID: 34570561 PMCID: PMC8923917 DOI: 10.1037/xge0001087] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Children face a difficult task in learning how to reason about other people's emotions. How intensely facial configurations are displayed can vary not only according to what and how much emotion people are experiencing, but also across individuals based on differences in personality, gender, and culture. To navigate these sources of variability, children may use statistical information about other's facial cues to make interpretations about perceived emotions in others. We examined this possibility by testing children's ability to adjust to differences in the intensity of facial cues across different individuals. In the present study, children (6- to 10-year-olds) categorized the information communicated by facial configurations of emotion varying continuously from "calm" to "upset," with differences in the intensity of each actor's facial movements. We found that children's threshold for categorizing a facial configuration as "upset" shifted depending on the statistical information encountered about each of the different individuals. These results suggest that children are able to track individual differences in facial behavior and use these differences to flexibly update their interpretations of facial cues associated with emotion. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
Collapse
|
21
|
Zhang X, Cheng B, Zhang Y. The Role of Talker Variability in Nonnative Phonetic Learning: A Systematic Review and Meta-Analysis. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2021; 64:4802-4825. [PMID: 34763529 DOI: 10.1044/2021_jslhr-21-00181] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
PURPOSE High-variability phonetic training (HVPT) has been found to be effective on adult second language (L2) learning, but results are mixed in regards to the benefit of multiple talkers over single talker. This study provides a systematic review with meta-analysis to investigate the talker variability effect in nonnative phonetic learning and the factors moderating the effect. METHOD We collected studies with keyword search in major academic databases including EBSCO, ERIC, MEDLINE, ProQuest Dissertations & Theses, Elsevier, Scopus, Wiley Online Library, and Web of Science. We identified potential participant-, training-, and study-related moderators and conducted a random-effects model meta-analysis for each individual variable. RESULTS On the basis of 18 studies with a total of 549 participants, we obtained a small-level summary effect size (Hedges' g = 0.46, 95% confidence interval [CI; 0.08, 0.84]) for the immediate training outcomes, which was greatly reduced (g = -0.04, 95% CI [-0.46, 0.37]) after removal of outliers and correction for publication bias, whereas the effect size for immediate perceptual gains was nearly medium (g = 0.56, 95% CI [0.13, 1.00]) compared with the nonsignificant production gains. Critically, the summary effect sizes for generalizations to new talkers (g = 0.72, 95% CI [0.15, 1.29]) and for long-term retention (g = 1.09, 95% CI [0.39, 1.78]) were large. Moreover, the training program length and the talker presentation format were found to potentially moderate the immediate perceptual gains and generalization outcomes. CONCLUSIONS Our study presents the first meta-analysis on the role of talker variability in nonnative phonetic training, which demonstrates the heterogeneity and limitations of research on this topic. The results highlight the need for further investigation of the influential factors and underlying mechanisms for the presence or absence of talker variability effects. Supplemental Material https://doi.org/10.23641/asha.16959388.
Collapse
Affiliation(s)
- Xiaojuan Zhang
- English Department & Language and Cognitive Neuroscience Lab, School of Foreign Studies, Xi'an Jiaotong University, China
| | - Bing Cheng
- English Department & Language and Cognitive Neuroscience Lab, School of Foreign Studies, Xi'an Jiaotong University, China
| | - Yang Zhang
- Department of Speech-Language-Hearing Sciences and Center for Neurobehavioral Development, University of Minnesota, Twin Cities, Minneapolis
| |
Collapse
|
22
|
Kurumada C, Roettger TB. Thinking probabilistically in the study of intonational speech prosody. WILEY INTERDISCIPLINARY REVIEWS. COGNITIVE SCIENCE 2021; 13:e1579. [PMID: 34599647 DOI: 10.1002/wcs.1579] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 08/09/2021] [Accepted: 08/26/2021] [Indexed: 11/07/2022]
Abstract
Speech prosody, the melodic and rhythmic properties of a language, plays a critical role in our everyday communication. Researchers have identified unique patterns of prosody that segment words and phrases, highlight focal elements in a sentence, and convey holistic meanings and speech acts that interact with the information shared in context. The mapping between the sound and meaning represented in prosody is suggested to be probabilistic-the same physical instance of sounds can support multiple meanings across talkers and contexts while the same meaning can be encoded in physically distinct sound patterns (e.g., pitch movements). The current overview presents an analysis framework for probing the nature of this probabilistic relationship. Illustrated by examples from the literature and a dataset of German focus marking, we discuss the production variability within and across talkers and consider challenges that this variability imposes on the comprehension system. A better understanding of these challenges, we argue, will illuminate how the human perceptual, cognitive, and computational mechanisms may navigate the variability to arrive at a coherent understanding of speech prosody. The current paper is intended to be an introduction for those who are interested in thinking probabilistically about the sound-meaning mapping in prosody. Open questions for future research are discussed with proposals for examining prosodic production and comprehension within a comprehensive, mathematically-motivated framework of probabilistic inference under uncertainty. This article is categorized under: Linguistics > Language in Mind and Brain Psychology > Language.
Collapse
Affiliation(s)
- Chigusa Kurumada
- Department of Brain and Cognitive Sciences, University of Rochester, Rochester, New York, USA
| | - Timo B Roettger
- Department of Linguistics & Scandinavian Studies, Universitetet i Oslo, Oslo, Norway
| |
Collapse
|
23
|
Luthra S, Mechtenberg H, Myers EB. Perceptual learning of multiple talkers requires additional exposure. Atten Percept Psychophys 2021; 83:2217-2228. [PMID: 33754298 PMCID: PMC8217155 DOI: 10.3758/s13414-021-02261-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/22/2021] [Indexed: 11/08/2022]
Abstract
Because different talkers produce their speech sounds differently, listeners benefit from maintaining distinct generative models (sets of beliefs) about the correspondence between acoustic information and phonetic categories for different talkers. A robust literature on phonetic recalibration indicates that when listeners encounter a talker who produces their speech sounds idiosyncratically (e.g., a talker who produces their /s/ sound atypically), they can update their generative model for that talker. Such recalibration has been shown to occur in a relatively talker-specific way. Because listeners in ecological situations often meet several new talkers at once, the present study considered how the process of simultaneously updating two distinct generative models compares to updating one model at a time. Listeners were exposed to two talkers, one who produced /s/ atypically and one who produced /∫/ atypically. Critically, these talkers only produced these sounds in contexts where lexical information disambiguated the phoneme's identity (e.g., epi_ode, flouri_ing). When initial exposure to the two talkers was blocked by voice (Experiment 1), listeners recalibrated to these talkers after relatively little exposure to each talker (32 instances per talker, of which 16 contained ambiguous fricatives). However, when the talkers were intermixed during learning (Experiment 2), listeners required more exposure trials before they were able to adapt to the idiosyncratic productions of these talkers (64 instances per talker, of which 32 contained ambiguous fricatives). Results suggest that there is a perceptual cost to simultaneously updating multiple distinct generative models, potentially because listeners must first select which generative model to update.
Collapse
Affiliation(s)
- Sahil Luthra
- Department of Psychological Sciences, University of Connecticut, Storrs, CT, USA.
- The Connecticut Institute for the Brain and Cognitive Sciences, Storrs, CT, USA.
| | - Hannah Mechtenberg
- Department of Psychological Sciences, University of Connecticut, Storrs, CT, USA
| | - Emily B Myers
- Department of Psychological Sciences, University of Connecticut, Storrs, CT, USA
- The Connecticut Institute for the Brain and Cognitive Sciences, Storrs, CT, USA
- Department of Speech, Language and Hearing Sciences, University of Connecticut, Storrs, CT, USA
| |
Collapse
|
24
|
Encoding and decoding of meaning through structured variability in intonational speech prosody. Cognition 2021; 211:104619. [DOI: 10.1016/j.cognition.2021.104619] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 11/25/2020] [Accepted: 01/27/2021] [Indexed: 11/17/2022]
|
25
|
Woodard K, Plate RC, Morningstar M, Wood A, Pollak SD. Categorization of Vocal Emotion Cues Depends on Distributions of Input. AFFECTIVE SCIENCE 2021; 2:301-310. [PMID: 33870212 PMCID: PMC8035059 DOI: 10.1007/s42761-021-00038-w] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Accepted: 02/09/2021] [Indexed: 01/23/2023]
Abstract
Learners use the distributional properties of stimuli to identify environmentally relevant categories in a range of perceptual domains, including words, shapes, faces, and colors. We examined whether similar processes may also operate on affective information conveyed through the voice. In Experiment 1, we tested how adults (18–22-year-olds) and children (8–10-year-olds) categorized affective states communicated by vocalizations varying continuously from “calm” to “upset.” We found that the threshold for categorizing both verbal (i.e., spoken word) and nonverbal (i.e., a yell) vocalizations as “upset” depended on the statistical distribution of the stimuli participants encountered. In Experiment 2, we replicated and extended these findings in adults using vocalizations that conveyed multiple negative affect states. These results suggest perceivers’ flexibly and rapidly update their interpretation of affective vocal cues based upon context.
Collapse
Affiliation(s)
- Kristina Woodard
- Department of Psychology, University of Wisconsin – Madison, 1500 Highland Avenue, Madison, WI 53705 USA
| | - Rista C. Plate
- Department of Psychology, University of Wisconsin – Madison, 1500 Highland Avenue, Madison, WI 53705 USA
- Department of Psychology, University of Pennsylvania, 3720 Walnut Street, Philadelphia, PA 19104 USA
| | | | - Adrienne Wood
- Department of Psychology, University of Virginia, 485 McCormick Rd, Charlottesville, VA 22904 USA
| | - Seth D. Pollak
- Department of Psychology, University of Wisconsin – Madison, 1500 Highland Avenue, Madison, WI 53705 USA
| |
Collapse
|
26
|
A graph-theoretic approach to identifying acoustic cues for speech sound categorization. Psychon Bull Rev 2021; 27:1104-1125. [PMID: 32671571 DOI: 10.3758/s13423-020-01748-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Human speech contains a wide variety of acoustic cues that listeners must map onto distinct phoneme categories. The large amount of information contained in these cues contributes to listeners' remarkable ability to accurately recognize speech across a variety of contexts. However, these cues vary across talkers, both in terms of how specific cue values map onto different phonemes and in terms of which cues individual talkers use most consistently to signal specific phonological contrasts. This creates a challenge for models that aim to characterize the information used to recognize speech. How do we balance the need to account for variability in speech sounds across a wide range of talkers with the need to avoid overspecifying which acoustic cues describe the mapping from speech sounds onto phonological distinctions? We present an approach using tools from graph theory that addresses this issue by creating networks describing connections between individual talkers and acoustic cues and by identifying subgraphs within these networks. This allows us to reduce the space of possible acoustic cues that signal a given phoneme to a subset that still accounts for variability across talkers, simplifying the model and providing insights into which cues are most relevant for specific phonemes. Classifiers trained on the subset of cue dimensions identified in the subgraphs provide fits to listeners' categorization that are similar to those obtained for classifiers trained on all cue dimensions, demonstrating that the subgraphs capture the cues necessary to categorize speech sounds.
Collapse
|
27
|
Luthra S, Magnuson JS, Myers EB. Boosting lexical support does not enhance lexically guided perceptual learning. J Exp Psychol Learn Mem Cogn 2021; 47:685-704. [PMID: 33983786 PMCID: PMC8287971 DOI: 10.1037/xlm0000945] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
A challenge for listeners is to learn the appropriate mapping between acoustics and phonetic categories for an individual talker. Lexically guided perceptual learning (LGPL) studies have shown that listeners can leverage lexical knowledge to guide this process. For instance, listeners learn to interpret ambiguous /s/-/∫/ blends as /s/ if they have previously encountered them in /s/-biased contexts like epi?ode. Here, we examined whether the degree of preceding lexical support might modulate the extent of perceptual learning. In Experiment 1, we first demonstrated that perceptual learning could be obtained in a modified LGPL paradigm where listeners were first biased to interpret ambiguous tokens as one phoneme (e.g., /s/) and then later as another (e.g., /∫/). In subsequent experiments, we tested whether the extent of learning differed depending on whether targets encountered predictive contexts or neutral contexts prior to the auditory target (e.g., epi?ode). Experiment 2 used auditory sentence contexts (e.g., "I love The Walking Dead and eagerly await every new . . ."), whereas Experiment 3 used written sentence contexts. In Experiment 4, participants did not receive sentence contexts but rather saw the written form of the target word (episode) or filler text (########) prior to hearing the critical auditory token. While we consistently observed effects of context on in-the-moment processing of critical words, the size of the learning effect was not modulated by the type of context. We hypothesize that boosting lexical support through preceding context may not strongly influence perceptual learning when ambiguous speech sounds can be identified solely from lexical information. (PsycInfo Database Record (c) 2021 APA, all rights reserved).
Collapse
|
28
|
Luthra S. The Role of the Right Hemisphere in Processing Phonetic Variability Between Talkers. NEUROBIOLOGY OF LANGUAGE (CAMBRIDGE, MASS.) 2021; 2:138-151. [PMID: 37213418 PMCID: PMC10174361 DOI: 10.1162/nol_a_00028] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Accepted: 11/13/2020] [Indexed: 05/23/2023]
Abstract
Neurobiological models of speech perception posit that both left and right posterior temporal brain regions are involved in the early auditory analysis of speech sounds. However, frank deficits in speech perception are not readily observed in individuals with right hemisphere damage. Instead, damage to the right hemisphere is often associated with impairments in vocal identity processing. Herein lies an apparent paradox: The mapping between acoustics and speech sound categories can vary substantially across talkers, so why might right hemisphere damage selectively impair vocal identity processing without obvious effects on speech perception? In this review, I attempt to clarify the role of the right hemisphere in speech perception through a careful consideration of its role in processing vocal identity. I review evidence showing that right posterior superior temporal, right anterior superior temporal, and right inferior / middle frontal regions all play distinct roles in vocal identity processing. In considering the implications of these findings for neurobiological accounts of speech perception, I argue that the recruitment of right posterior superior temporal cortex during speech perception may specifically reflect the process of conditioning phonetic identity on talker information. I suggest that the relative lack of involvement of other right hemisphere regions in speech perception may be because speech perception does not necessarily place a high burden on talker processing systems, and I argue that the extant literature hints at potential subclinical impairments in the speech perception abilities of individuals with right hemisphere damage.
Collapse
|
29
|
Luthra S, Correia JM, Kleinschmidt DF, Mesite L, Myers EB. Lexical Information Guides Retuning of Neural Patterns in Perceptual Learning for Speech. J Cogn Neurosci 2020; 32:2001-2012. [PMID: 32662731 PMCID: PMC8048099 DOI: 10.1162/jocn_a_01612] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
A listener's interpretation of a given speech sound can vary probabilistically from moment to moment. Previous experience (i.e., the contexts in which one has encountered an ambiguous sound) can further influence the interpretation of speech, a phenomenon known as perceptual learning for speech. This study used multivoxel pattern analysis to query how neural patterns reflect perceptual learning, leveraging archival fMRI data from a lexically guided perceptual learning study conducted by Myers and Mesite [Myers, E. B., & Mesite, L. M. Neural systems underlying perceptual adjustment to non-standard speech tokens. Journal of Memory and Language, 76, 80-93, 2014]. In that study, participants first heard ambiguous /s/-/∫/ blends in either /s/-biased lexical contexts (epi_ode) or /∫/-biased contexts (refre_ing); subsequently, they performed a phonetic categorization task on tokens from an /asi/-/a∫i/ continuum. In the current work, a classifier was trained to distinguish between phonetic categorization trials in which participants heard unambiguous productions of /s/ and those in which they heard unambiguous productions of /∫/. The classifier was able to generalize this training to ambiguous tokens from the middle of the continuum on the basis of individual participants' trial-by-trial perception. We take these findings as evidence that perceptual learning for speech involves neural recalibration, such that the pattern of activation approximates the perceived category. Exploratory analyses showed that left parietal regions (supramarginal and angular gyri) and right temporal regions (superior, middle, and transverse temporal gyri) were most informative for categorization. Overall, our results inform an understanding of how moment-to-moment variability in speech perception is encoded in the brain.
Collapse
Affiliation(s)
| | - João M Correia
- University of Algarve
- Basque Center on Cognition, Brain and Language
| | | | - Laura Mesite
- MGH Institute of Health Professions
- Harvard Graduate School of Education
| | | |
Collapse
|
30
|
Tanner J, Sonderegger M, Stuart-Smith J. Structured speaker variability in Japanese stops: Relationships within versus across cues to stop voicing. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 148:793. [PMID: 32872992 DOI: 10.1121/10.0001734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 07/23/2020] [Indexed: 06/11/2023]
Abstract
A number of recent studies have observed that phonetic variability is constrained across speakers, where speakers exhibit limited variation in the signalling of phonological contrasts in spite of overall differences between speakers. This previous work focused predominantly on controlled laboratory speech and on contrasts in English and German, leaving unclear how such speaker variability is structured in spontaneous speech and in phonological contrasts that make substantial use of more than one acoustic cue. This study attempts to both address these empirical gaps and expand the empirical scope of research investigating structured variability by examining how speakers vary in the use of positive voice onset time and voicing during closure in marking the stop voicing contrast in Japanese spontaneous speech. Strong covarying relationships within each cue across speakers are observed, while between-cue relationships across speakers are much weaker, suggesting that structured variability is constrained by the language-specific phonetic implementation of linguistic contrasts.
Collapse
Affiliation(s)
- James Tanner
- Department of Linguistics, McGill University, Montreal, Quebec, Canada
| | | | - Jane Stuart-Smith
- Department of English Language and Linguistics, University of Glasgow, Glasgow, United Kingdom
| |
Collapse
|
31
|
Selecting among competing models of talker adaptation: Attention, cognition, and memory in speech processing efficiency. Cognition 2020; 204:104393. [PMID: 32688132 DOI: 10.1016/j.cognition.2020.104393] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Revised: 06/14/2020] [Accepted: 06/29/2020] [Indexed: 11/24/2022]
Abstract
Phonetic variability across talkers imposes additional processing costs during speech perception, often measured by performance decrements between single- and mixed-talker conditions. However, models differ in their predictions about whether accommodating greater phonetic variability (i.e., more talkers) imposes greater processing costs. We measured speech processing efficiency in a speeded word identification task, in which we manipulated the number of talkers (1, 2, 4, 8, or 16) listeners heard. Word identification was less efficient in every mixed-talker condition compared to the single-talker condition, but the magnitude of this performance decrement was not affected by the number of talkers. Furthermore, in a condition with uniform transition probabilities between two talkers, word identification was more efficient when the talker was the same as the prior trial compared to trials when the talker switched. These results support an auditory streaming model of talker adaptation, where processing costs associated with changing talkers result from attentional reorientation.
Collapse
|
32
|
Hawkins RD, Frank MC, Goodman ND. Characterizing the Dynamics of Learning in Repeated Reference Games. Cogn Sci 2020; 44:e12845. [PMID: 32496603 DOI: 10.1111/cogs.12845] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 03/04/2020] [Accepted: 04/06/2020] [Indexed: 11/30/2022]
Abstract
The language we use over the course of conversation changes as we establish common ground and learn what our partner finds meaningful. Here we draw upon recent advances in natural language processing to provide a finer-grained characterization of the dynamics of this learning process. We release an open corpus (>15,000 utterances) of extended dyadic interactions in a classic repeated reference game task where pairs of participants had to coordinate on how to refer to initially difficult-to-describe tangram stimuli. We find that different pairs discover a wide variety of idiosyncratic but efficient and stable solutions to the problem of reference. Furthermore, these conventions are shaped by the communicative context: words that are more discriminative in the initial context (i.e., that are used for one target more than others) are more likely to persist through the final repetition. Finally, we find systematic structure in how a speaker's referring expressions become more efficient over time: Syntactic units drop out in clusters following positive feedback from the listener, eventually leaving short labels containing open-class parts of speech. These findings provide a higher resolution look at the quantitative dynamics of ad hoc convention formation and support further development of computational models of learning in communication.
Collapse
Affiliation(s)
| | | | - Noah D Goodman
- Department of Psychology, Stanford University.,Department of Computer Science, Stanford University
| |
Collapse
|
33
|
Tanner J, Sonderegger M, Stuart-Smith J, Fruehwald J. Toward "English" Phonetics: Variability in the Pre-consonantal Voicing Effect Across English Dialects and Speakers. Front Artif Intell 2020; 3:38. [PMID: 33733155 PMCID: PMC7861323 DOI: 10.3389/frai.2020.00038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2019] [Accepted: 05/01/2010] [Indexed: 11/16/2022] Open
Abstract
Recent advances in access to spoken-language corpora and development of speech processing tools have made possible the performance of "large-scale" phonetic and sociolinguistic research. This study illustrates the usefulness of such a large-scale approach-using data from multiple corpora across a range of English dialects, collected, and analyzed with the SPADE project-to examine how the pre-consonantal Voicing Effect (longer vowels before voiced than voiceless obstruents, in e.g., bead vs. beat) is realized in spontaneous speech, and varies across dialects and individual speakers. Compared with previous reports of controlled laboratory speech, the Voicing Effect was found to be substantially smaller in spontaneous speech, but still influenced by the expected range of phonetic factors. Dialects of English differed substantially from each other in the size of the Voicing Effect, whilst individual speakers varied little relative to their particular dialect. This study demonstrates the value of large-scale phonetic research as a means of developing our understanding of the structure of speech variability, and illustrates how large-scale studies, such as those carried out within SPADE, can be applied to other questions in phonetic and sociolinguistic research.
Collapse
Affiliation(s)
- James Tanner
- Department of Linguistics, McGill University, Montreal, QC, Canada
| | | | - Jane Stuart-Smith
- Glasgow University Laboratory of Phonetics, University of Glasgow, Glasgow, United Kingdom
| | - Josef Fruehwald
- Department of Linguistics, University of Kentucky, Kentucky, KY, United States
| |
Collapse
|
34
|
Jacewicz E, Fox RA. Perception of local and non-local vowels by adults and children in the South. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:627. [PMID: 32006983 PMCID: PMC7043861 DOI: 10.1121/10.0000542] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Revised: 07/23/2019] [Accepted: 09/05/2019] [Indexed: 06/10/2023]
Abstract
This study assessed the ability of Southern listeners to accommodate extensive talker variability in identifying vowels in their local Appalachian community in the context of sound change. Building on prior work, the current experiment targeted a subset of spectrally overlapping vowels in local and two non-local varieties to establish whether adult and child listeners will demonstrate the local dialect advantage. Listeners responded to isolated target words, which minimized the interaction of multiple linguistic and dialect-specific features. For most vowel categories, the local dialect advantage was not demonstrated. However, adult listeners showed sensitivity to generational changes, indicating their familiarity with the local norms. A differential response pattern in children suggests that children perceived the vowels through the lens of their own experience with vowel production, representing a sound change in the community. Compared with the adults, children also relied more on stress cues, with increased confusions when the vowels were unstressed. The study provides evidence that identification accuracy is dependent upon the robustness of cues in individual vowel categories-whether local or non-local-and suggests that the bottom-up processes underlying phonetic vowel categorization in isolated monosyllables can interact with the top-down processing of dialect- and talker-specific information.
Collapse
Affiliation(s)
- Ewa Jacewicz
- Department of Speech and Hearing Science, The Ohio State University, 1070 Carmack Road, Columbus, Ohio 43210, USA
| | - Robert Allen Fox
- Department of Speech and Hearing Science, The Ohio State University, 1070 Carmack Road, Columbus, Ohio 43210, USA
| |
Collapse
|
35
|
Choi JY, Perrachione TK. Time and information in perceptual adaptation to speech. Cognition 2019; 192:103982. [PMID: 31229740 PMCID: PMC6732236 DOI: 10.1016/j.cognition.2019.05.019] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Revised: 05/11/2019] [Accepted: 05/25/2019] [Indexed: 11/18/2022]
Abstract
Perceptual adaptation to a talker enables listeners to efficiently resolve the many-to-many mapping between variable speech acoustics and abstract linguistic representations. However, models of speech perception have not delved into the variety or the quantity of information necessary for successful adaptation, nor how adaptation unfolds over time. In three experiments using speeded classification of spoken words, we explored how the quantity (duration), quality (phonetic detail), and temporal continuity of talker-specific context contribute to facilitating perceptual adaptation to speech. In single- and mixed-talker conditions, listeners identified phonetically-confusable target words in isolation or preceded by carrier phrases of varying lengths and phonetic content, spoken by the same talker as the target word. Word identification was always slower in mixed-talker conditions than single-talker ones. However, interference from talker variability decreased as the duration of preceding speech increased but was not affected by the amount of preceding talker-specific phonetic information. Furthermore, efficiency gains from adaptation depended on temporal continuity between preceding speech and the target word. These results suggest that perceptual adaptation to speech may be understood via models of auditory streaming, where perceptual continuity of an auditory object (e.g., a talker) facilitates allocation of attentional resources, resulting in more efficient perceptual processing.
Collapse
Affiliation(s)
- Ja Young Choi
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, MA, United States; Program in Speech and Hearing Bioscience and Technology, Harvard University, Cambridge, MA, United States
| | - Tyler K Perrachione
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, MA, United States.
| |
Collapse
|
36
|
Distributional learning for speech reflects cumulative exposure to a talker's phonetic distributions. Psychon Bull Rev 2019; 26:985-992. [PMID: 30604404 DOI: 10.3758/s13423-018-1551-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Efficient speech perception requires listeners to maintain an exquisite tension between stability of the language architecture and flexibility to accommodate variation in the input, such as that associated with individual talker differences in speech production. Achieving this tension can be guided by top-down learning mechanisms, wherein lexical information constrains interpretation of speech input, and by bottom-up learning mechanisms, in which distributional information in the speech signal is used to optimize the mapping to speech sound categories. An open question for theories of perceptual learning concerns the nature of the representations that are built for individual talkers: do these representations reflect long-term, global exposure to a talker or rather only short-term, local exposure? Recent research suggests that when lexical knowledge is used to resolve a talker's ambiguous productions, listeners disregard previous experience with a talker and instead rely on only recent experience, a finding that is contrary to predictions of Bayesian belief-updating accounts of perceptual adaptation. Here we use a distributional learning paradigm in which lexical information is not explicitly required to resolve ambiguous input to provide an additional test of global versus local exposure accounts. Listeners completed two blocks of phonetic categorization for stimuli that differed in voice-onset-time, a probabilistic cue to the voicing contrast in English stop consonants. In each block, two distributions were presented, one specifying /g/ and one specifying /k/. Across the two blocks, variance of the distributions was manipulated to be either narrow or wide. The critical manipulation was order of the two blocks; half of the listeners were first exposed to the narrow distributions followed by the wide distributions, with the order reversed for the other half of the listeners. The results showed that for earlier trials, the identification slope was steeper for the narrow-wide group compared to the wide-narrow group, but this difference was attenuated for later trials. The between-group convergence was driven by an asymmetry in learning between the two orders such that only those in the narrow-wide group showed slope movement during exposure, a pattern that was mirrored by computational simulations in which the distributional statistics of the present talker were integrated with prior experience with English. This pattern of results suggests that listeners did not disregard all prior experience with the talker, and instead used cumulative exposure to guide phonetic decisions, which raises the possibility that accommodating a talker's phonetic signature entails maintaining representations that reflect global experience.
Collapse
|