1
|
Lee J, Oxenham AJ. Testing the role of temporal coherence on speech intelligibility with noise and single-talker maskers. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 156:3285-3297. [PMID: 39545746 PMCID: PMC11575144 DOI: 10.1121/10.0034420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 10/25/2024] [Indexed: 11/17/2024]
Abstract
Temporal coherence, where sounds with aligned timing patterns are perceived as a single source, is considered an essential cue in auditory scene analysis. However, its effects have been studied primarily with simple repeating tones, rather than speech. This study investigated the role of temporal coherence in speech by introducing across-frequency asynchronies. The effect of asynchrony on the intelligibility of target sentences was tested in the presence of background speech-shaped noise or a single-talker interferer. Our hypothesis was that disrupting temporal coherence should not only reduce intelligibility but also impair listeners' ability to segregate the target speech from an interfering talker, leading to greater degradation for speech-in-speech than speech-in-noise tasks. Stimuli were filtered into eight frequency bands, which were then desynchronized with delays of 0-120 ms. As expected, intelligibility declined as asynchrony increased. However, the decline was similar for both noise and single-talker maskers. Primarily target, rather than masker, asynchrony affected performance for both natural (forward) and reversed-speech maskers, and for target sentences with low and high semantic context. The results suggest that temporal coherence may not be as critical a cue for speech segregation as it is for the non-speech stimuli traditionally used in studies of auditory scene analysis.
Collapse
Affiliation(s)
- Jaeeun Lee
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Andrew J Oxenham
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota 55455, USA
| |
Collapse
|
2
|
Chen F, Guo Q, Deng Y, Zhu J, Zhang H. Development of Mandarin Lexical Tone Identification in Noise and Its Relation With Working Memory. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2023; 66:4100-4116. [PMID: 37678219 DOI: 10.1044/2023_jslhr-22-00457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
PURPOSE This study aimed to examine the developmental trajectory of Mandarin tone identification in quiet and two noisy conditions: speech-shaped noise (SSN) and multitalker babble noise. In addition, we evaluated the relationship between tonal identification development and working memory capacity. METHOD Ninety-three typically developing children aged 5-8 years and 23 young adults completed categorical identification of two tonal continua (Tone 1-4 and Tone 2-3) in quiet, SSN, and babble noise. Their working memory was additionally measured using auditory digit span tests. Correlation analyses between digit span scores and boundary widths were performed. RESULTS Six-year-old children have achieved the adultlike ability of categorical identification of Tone 1-4 continuum under both types of noise. Moreover, 6-year-old children could identify Tone 2-3 continuum as well as adults in SSN. Nonetheless, the child participants, even 8-year-olds, performed worse when tokens from Tone 2-3 continuum were masked by babble noise. Greater working memory capacity was associated with better tone identification in noise for preschoolers aged 5-6 years; however, for school-age children aged 7-8 years, such correlation only existed in Tone 2-3 continuum in SSN. CONCLUSIONS Lexical tone perception might take a prolonged time to achieve adultlike competence in babble noise relative to SSN. Moreover, a significant interaction between masking type and stimulus difficulty was found, as indicated by Tone 2-3 being more susceptible to interference from babble noise than Tone 1-4. Furthermore, correlations between working memory capacity and tone perception in noise varied with developmental stage, stimulus difficulty, and masking type.
Collapse
Affiliation(s)
- Fei Chen
- School of Foreign Languages, Hunan University, Changsha, China
| | - Qingqing Guo
- School of Foreign Languages, Hunan University, Changsha, China
| | - Yunhua Deng
- Foreign Studies College, Hunan Normal University, Changsha, China
| | - Jiaqiang Zhu
- Research Centre for Language, Cognition, and Neuroscience, Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hung Hom, Hong Kong SAR, China
| | - Hao Zhang
- Center for Clinical Neurolinguistics, School of Foreign Languages and Literature, Shandong University, Jinan, China
| |
Collapse
|
3
|
Villard S, Perrachione TK, Lim SJ, Alam A, Kidd G. Energetic and informational masking place dissociable demands on listening effort: Evidence from simultaneous electroencephalography and pupillometrya). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 154:1152-1167. [PMID: 37610284 PMCID: PMC10449482 DOI: 10.1121/10.0020539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 07/09/2023] [Accepted: 07/14/2023] [Indexed: 08/24/2023]
Abstract
The task of processing speech masked by concurrent speech/noise can pose a substantial challenge to listeners. However, performance on such tasks may not directly reflect the amount of listening effort they elicit. Changes in pupil size and neural oscillatory power in the alpha range (8-12 Hz) are prominent neurophysiological signals known to reflect listening effort; however, measurements obtained through these two approaches are rarely correlated, suggesting that they may respond differently depending on the specific cognitive demands (and, by extension, the specific type of effort) elicited by specific tasks. This study aimed to compare changes in pupil size and alpha power elicited by different types of auditory maskers (highly confusable intelligible speech maskers, speech-envelope-modulated speech-shaped noise, and unmodulated speech-shaped noise maskers) in young, normal-hearing listeners. Within each condition, the target-to-masker ratio was set at the participant's individually estimated 75% correct point on the psychometric function. The speech masking condition elicited a significantly greater increase in pupil size than either of the noise masking conditions, whereas the unmodulated noise masking condition elicited a significantly greater increase in alpha oscillatory power than the speech masking condition, suggesting that the effort needed to solve these respective tasks may have different neural origins.
Collapse
Affiliation(s)
- Sarah Villard
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| | - Tyler K Perrachione
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| | - Sung-Joo Lim
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| | - Ayesha Alam
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| | - Gerald Kidd
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
4
|
Roverud E, Villard S, Kidd G. Strength of target source segregation cues affects the outcome of speech-on-speech masking experiments. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:2780. [PMID: 37140176 PMCID: PMC10319449 DOI: 10.1121/10.0019307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 04/11/2023] [Accepted: 04/14/2023] [Indexed: 05/05/2023]
Abstract
In speech-on-speech listening experiments, some means for designating which talker is the "target" must be provided for the listener to perform better than chance. However, the relative strength of the segregation variables designating the target could affect the results of the experiment. Here, we examine the interaction of two source segregation variables-spatial separation and talker gender differences-and demonstrate that the relative strengths of these cues may affect the interpretation of the results. Participants listened to sentence pairs spoken by different-gender target and masker talkers, presented naturally or vocoded (degrading gender cues), either colocated or spatially separated. Target and masker words were temporally interleaved to eliminate energetic masking in either an every-other-word or randomized order of presentation. Results showed that the order of interleaving had no effect on recall performance. For natural speech with strong talker gender cues, spatial separation of sources yielded no improvement in performance. For vocoded speech with degraded talker gender cues, performance improved significantly with spatial separation of sources. These findings reveal that listeners may shift among target source segregation cues contingent on cue viability. Finally, performance was poor when the target was designated after stimulus presentation, indicating strong reliance on the cues.
Collapse
Affiliation(s)
- Elin Roverud
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| | - Sarah Villard
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| | - Gerald Kidd
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
5
|
Krueger M, Schulte M, Brand T. Assessing and Modeling Spatial Release From Listening Effort in Listeners With Normal Hearing: Reference Ranges and Effects of Noise Direction and Age. Trends Hear 2022; 26:23312165221129407. [PMID: 36285532 PMCID: PMC9618758 DOI: 10.1177/23312165221129407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
Listening to speech in noisy environments is challenging and effortful. Factors like the signal-to-noise ratio (SNR), the spatial separation between target speech and noise interferer(s), and possibly also the listener's age might influence perceived listening effort (LE). This study measured and modeled the effect of the spatial separation of target speech and interfering stationary speech-shaped noise on the perceived LE and its relation to the age of the listeners. Reference ranges for the relationship between subjectively perceived LE and SNR for different noise azimuths were established. For this purpose, 70 listeners with normal hearing and from three age groups rated the perceived LE using the Adaptive Categorical Listening Effort Scaling method (ACALES, Krueger et al., 2017a) with speech from the front and noise from 0°, 90°, 135°, or 180° azimuth. Based on these data, the spatial release from listening effort (SRLE) was calculated. The noise azimuth had a strong effect on SRLE, with the highest release for 135°. The binaural speech intelligibility model (BSIM2020, Hauth et al., 2020) predicted SRLE very well at negative SNRs, but overestimated for positive SNRs. No significant effect of age was found on the respective subjective ratings. Therefore, the reference ranges were determined independently of age. These reference ranges can be used for the classification of LE measurements. However, when the increase of the perceived LE with SNR was analyzed, a significant age difference was found between the listeners of the youngest and oldest group when considering the upper range of the LE function.
Collapse
Affiliation(s)
- Melanie Krueger
- Hörzentrum Oldenburg gGmbH, Oldenburg, Germany,Melanie Krueger, Hörzentrum Oldenburg gGmbH, Marie-Curie-Straße 2, D-26129 Oldenburg, Germany.
| | | | - Thomas Brand
- Medizinische Physik, Department für Medizinische Physik und Akustik, Fakultät VI, Carl-von-Ossietzky Universität Oldenburg, Oldenburg, Germany
| |
Collapse
|
6
|
Cho AY, Kidd G. Auditory motion as a cue for source segregation and selection in a "cocktail party" listening environment. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:1684. [PMID: 36182296 PMCID: PMC9489258 DOI: 10.1121/10.0013990] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Source motion was examined as a cue for segregating concurrent speech or noise sources. In two different headphone-based tasks-motion detection (MD) and speech-on-speech masking (SI)-one source among three was designated as the target only by imposing sinusoidal variation in azimuth during the stimulus presentation. For MD, the lstener was asked which of the three concurrent sources was in motion during the trial. For SI, the listener was asked to report the words spoken by the moving speech source. MD performance improved as the amplitude of the sinusoidal motion (i.e., displacement in azimuth) increased over the range of values tested (±5° to ±30°) for both modulated noise and speech targets, with better performance found for speech. SI performance also improved as the amplitude of target motion increased. Furthermore, SI performance improved as word position progressed throughout the sentence. Performance on the MD task was correlated with performance on SI task across individual subjects. For the SI conditions tested here, these findings are consistent with the proposition that listeners first detect the moving target source, then focus attention on the target location as the target sentence unfolds.
Collapse
Affiliation(s)
- Adrian Y Cho
- Speech and Hearing Bioscience and Technology Program, Harvard University, Cambridge, Massachusetts 02138, USA
| | - Gerald Kidd
- Department of Speech, Language, and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
7
|
Jagadeesh AB, Uppunda AK. Effect of Age on Informational Masking: Differential Effects of Phonetic and Semantic Information in the Masker. Am J Audiol 2022; 31:707-718. [PMID: 35926084 DOI: 10.1044/2022_aja-22-00029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
PURPOSE Speech recognition in noise is a ubiquitous problem in older listeners. Speech, the most commonly encountered noise in the real world, causes greater masking than noise maskers, a phenomenon called informational masking (IM). This is due to the lexical-semantic and/or acoustic-phonetic information present in speech maskers. In this study, we aimed to observe the age-related differences in speech recognition and the magnitudes of IM when the maskers varied in the type of linguistic information. METHOD In 30 young and 30 older individuals, we measured the signal-to-noise ratio required to obtain 50% correct identification under four-talker babble (lexical-semantic and acoustic-phonetic information), four-talker reverse babble (predominantly acoustic-phonetic information), and speech-shaped noise (SSN; energetic). RESULTS In both groups, the four-talker babble caused the greatest masking effect (worst performances), whereas the SSN resulted in the least masking effect (best performances). The effectiveness of IM due to the lexical-semantic information was comparable between the two groups. However, the effectiveness of IM due to the acoustic-phonetic information was significantly higher in the older listeners, causing worse performances. CONCLUSIONS The greater effectiveness of IM due to the acoustic-phonetic information (worse performance) could be due to the minimal-to-mild high-frequency hearing loss and the consequent temporal processing deficits observed in the older listeners. However, it is possible that the older listeners can employ compensatory mechanisms (such as life experiences, contextual cues, employing higher listening efforts, among many possible other mechanisms) to overcome some of these deficits. SUPPLEMENTAL MATERIAL https://doi.org/10.23641/asha.20405730.
Collapse
Affiliation(s)
- Anoop Basavanahalli Jagadeesh
- Auditory Neuroscience Laboratory, Roxelyn and Richard Pepper Department of Communication Sciences and Disorders, Northwestern University, Evanston, IL
| | | |
Collapse
|
8
|
Huet MP, Micheyl C, Gaudrain E, Parizet E. Vocal and semantic cues for the segregation of long concurrent speech stimuli in diotic and dichotic listening-The Long-SWoRD test. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 151:1557. [PMID: 35364949 DOI: 10.1121/10.0007225] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Accepted: 10/25/2021] [Indexed: 06/14/2023]
Abstract
It is not always easy to follow a conversation in a noisy environment. To distinguish between two speakers, a listener must mobilize many perceptual and cognitive processes to maintain attention on a target voice and avoid shifting attention to the background noise. The development of an intelligibility task with long stimuli-the Long-SWoRD test-is introduced. This protocol allows participants to fully benefit from the cognitive resources, such as semantic knowledge, to separate two talkers in a realistic listening environment. Moreover, this task also provides the experimenters with a means to infer fluctuations in auditory selective attention. Two experiments document the performance of normal-hearing listeners in situations where the perceptual separability of the competing voices ranges from easy to hard using a combination of voice and binaural cues. The results show a strong effect of voice differences when the voices are presented diotically. In addition, analyzing the influence of the semantic context on the pattern of responses indicates that the semantic information induces a response bias in situations where the competing voices are distinguishable and indistinguishable from one another.
Collapse
Affiliation(s)
- Moïra-Phoebé Huet
- Laboratory of Vibration and Acoustics, National Institute of Applied Sciences, University of Lyon, 20 Avenue Albert Einstein, Villeurbanne, 69100, France
| | | | - Etienne Gaudrain
- Lyon Neuroscience Research Center, Auditory Cognition and Psychoacoustics, Centre National de la Recerche Scientifique UMR5292, Institut National de la Santé et de la Recherche Médicale U1028, Université Claude Bernard Lyon 1, Université de Lyon, Centre Hospitalier Le Vinatier, Neurocampus, 95 boulevard Pinel, Bron Cedex, 69675, France
| | - Etienne Parizet
- Laboratory of Vibration and Acoustics, National Institute of Applied Sciences, University of Lyon, 20 Avenue Albert Einstein, Villeurbanne, 69100, France
| |
Collapse
|
9
|
Abstract
Identification of speech from a "target" talker was measured in a speech-on-speech
masking task with two simultaneous "masker" talkers. The overall level of each talker was
either fixed or randomized throughout each stimulus presentation to investigate the
effectiveness of level as a cue for segregating competing talkers and attending to the
target. Experimental manipulations included varying the level difference between talkers
and imposing three types of target level uncertainty: 1) fixed target level across trials,
2) random target level across trials, or 3) random target levels on a word-by-word basis
within a trial. When the target level was predictable performance was better than
corresponding conditions when the target level was uncertain. Masker confusions were
consistent with a high degree of informational masking (IM). Furthermore, evidence was
found for "tuning" in level and a level "release" from IM. These findings suggest that
conforming to listener expectation about relative level, in addition to cues signaling
talker identity, facilitates segregation of, and maintaining focus of attention on, a
specific talker in multiple-talker communication situations.
Collapse
Affiliation(s)
- Andrew J Byrne
- Department of Speech, Language, & Hearing Sciences, 1846Boston University, MA, USA
| | - Christopher Conroy
- Department of Speech, Language, & Hearing Sciences, 1846Boston University, MA, USA
| | - Gerald Kidd
- Department of Speech, Language, & Hearing Sciences, 1846Boston University, MA, USA.,Department of Otolaryngology, Head-Neck Surgery, Medical University of South Carolina, Charleston, SC, USA
| |
Collapse
|
10
|
DeRoy Milvae K, Kuchinsky SE, Stakhovskaya OA, Goupell MJ. Dichotic listening performance and effort as a function of spectral resolution and interaural symmetry. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:920. [PMID: 34470337 PMCID: PMC8346288 DOI: 10.1121/10.0005653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 06/30/2021] [Accepted: 06/30/2021] [Indexed: 06/13/2023]
Abstract
One potential benefit of bilateral cochlear implants is reduced listening effort in speech-on-speech masking situations. However, the symmetry of the input across ears, possibly related to spectral resolution, could impact binaural benefits. Fifteen young adults with normal hearing performed digit recall with target and interfering digits presented to separate ears and attention directed to the target ear. Recall accuracy and pupil size over time (used as an index of listening effort) were measured for unprocessed, 16-channel vocoded, and 4-channel vocoded digits. Recall accuracy was significantly lower for dichotic (with interfering digits) than for monotic listening. Dichotic recall accuracy was highest when the target was less degraded and the interferer was more degraded. With matched target and interferer spectral resolution, pupil dilation was lower with more degradation. Pupil dilation grew more shallowly over time when the interferer had more degradation. Overall, interferer spectral resolution more strongly affected listening effort than target spectral resolution. These results suggest that interfering speech both lowers performance and increases listening effort, and that the relative spectral resolution of target and interferer affect the listening experience. Ignoring a clearer interferer is more effortful.
Collapse
Affiliation(s)
- Kristina DeRoy Milvae
- Department of Hearing and Speech Sciences, University of Maryland, College Park, Maryland 20742, USA
| | - Stefanie E Kuchinsky
- Audiology and Speech Pathology Center, Walter Reed National Military Medical Center, Bethesda, Maryland 20889, USA
| | - Olga A Stakhovskaya
- Department of Hearing and Speech Sciences, University of Maryland, College Park, Maryland 20742, USA
| | - Matthew J Goupell
- Department of Hearing and Speech Sciences, University of Maryland, College Park, Maryland 20742, USA
| |
Collapse
|
11
|
Lavandier M, Mason CR, Baltzell LS, Best V. Individual differences in speech intelligibility at a cocktail party: A modeling perspective. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:1076. [PMID: 34470293 PMCID: PMC8561716 DOI: 10.1121/10.0005851] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 07/07/2021] [Accepted: 07/21/2021] [Indexed: 06/13/2023]
Abstract
This study aimed at predicting individual differences in speech reception thresholds (SRTs) in the presence of symmetrically placed competing talkers for young listeners with sensorineural hearing loss. An existing binaural model incorporating the individual audiogram was revised to handle severe hearing losses by (a) taking as input the target speech level at SRT in a given condition and (b) introducing a floor in the model to limit extreme negative better-ear signal-to-noise ratios. The floor value was first set using SRTs measured with stationary and modulated noises. The model was then used to account for individual variations in SRTs found in two previously published data sets that used speech maskers. The model accounted well for the variation in SRTs across listeners with hearing loss, based solely on differences in audibility. When considering listeners with normal hearing, the model could predict the best SRTs, but not the poorer SRTs, suggesting that other factors limit performance when audibility (as measured with the audiogram) is not compromised.
Collapse
Affiliation(s)
- Mathieu Lavandier
- Univ. Lyon, ENTPE, Laboratoire de Tribologie et Dynamique des Systèmes UMR 5513, Rue Maurice Audin, F-69518 Vaulx-en-Velin Cedex, France
| | - Christine R Mason
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| | - Lucas S Baltzell
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| | - Virginia Best
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
12
|
Liu JS, Liu YW, Yu YF, Galvin JJ, Fu QJ, Tao DD. Segregation of competing speech in adults and children with normal hearing and in children with cochlear implants. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:339. [PMID: 34340485 DOI: 10.1121/10.0005597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 06/22/2021] [Indexed: 06/13/2023]
Abstract
Children with normal hearing (CNH) have greater difficulty segregating competing speech than do adults with normal hearing (ANH). Children with cochlear implants (CCI) have greater difficulty segregating competing speech than do CNH. In the present study, speech reception thresholds (SRTs) in competing speech were measured in Chinese Mandarin-speaking ANH, CNH, and CCIs. Target sentences were produced by a male Mandarin-speaking talker. Maskers were time-forward or -reversed sentences produced by a native Mandarin-speaking male (different from the target) or female or a non-native English-speaking male. The SRTs were lowest (best) for the ANH group, followed by the CNH and CCI groups. The masking release (MR) was comparable between the ANH and CNH group, but much poorer in the CCI group. The temporal properties differed between the native and non-native maskers and between forward and reversed speech. The temporal properties of the maskers were significantly associated with the SRTs for the CCI and CNH groups but not for the ANH group. Whereas the temporal properties of the maskers were significantly associated with the MR for all three groups, the association was stronger for the CCI and CNH groups than for the ANH group.
Collapse
Affiliation(s)
- Ji-Sheng Liu
- Department of Ear, Nose, and Throat, The First Affiliated Hospital of Soochow University, Suzhou 215006, China
| | - Yang-Wenyi Liu
- Department of Otology and Skull Base Surgery, Eye Ear Nose and Throat Hospital, Fudan University, Shanghai 200031, China
| | - Ya-Feng Yu
- Department of Ear, Nose, and Throat, The First Affiliated Hospital of Soochow University, Suzhou 215006, China
| | - John J Galvin
- House Ear Institute, Los Angeles, California 90057, USA
| | - Qian-Jie Fu
- Department of Head and Neck Surgery, David Geffen School of Medicine, University of California Los Angeles (UCLA), Los Angeles, California 90095, USA
| | - Duo-Duo Tao
- Department of Ear, Nose, and Throat, The First Affiliated Hospital of Soochow University, Suzhou 215006, China
| |
Collapse
|
13
|
Wang X, Xu L. Speech perception in noise: Masking and unmasking. J Otol 2021; 16:109-119. [PMID: 33777124 PMCID: PMC7985001 DOI: 10.1016/j.joto.2020.12.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 12/03/2020] [Accepted: 12/06/2020] [Indexed: 11/23/2022] Open
Abstract
Speech perception is essential for daily communication. Background noise or concurrent talkers, on the other hand, can make it challenging for listeners to track the target speech (i.e., cocktail party problem). The present study reviews and compares existing findings on speech perception and unmasking in cocktail party listening environments in English and Mandarin Chinese. The review starts with an introduction section followed by related concepts of auditory masking. The next two sections review factors that release speech perception from masking in English and Mandarin Chinese, respectively. The last section presents an overall summary of the findings with comparisons between the two languages. Future research directions with respect to the difference in literature on the reviewed topic between the two languages are also discussed.
Collapse
Affiliation(s)
- Xianhui Wang
- Communication Sciences and Disorders, Ohio University, Athens, OH, 45701, USA
| | - Li Xu
- Communication Sciences and Disorders, Ohio University, Athens, OH, 45701, USA
| |
Collapse
|
14
|
Thomas M, Galvin JJ, Fu QJ. Interactions among talker sex, masker number, and masker intelligibility in speech-on-speech recognition. JASA EXPRESS LETTERS 2021; 1:015203. [PMID: 33589889 PMCID: PMC7850016 DOI: 10.1121/10.0003051] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 12/16/2020] [Indexed: 05/27/2023]
Abstract
In competing speech, recognition of target speech may be limited by the number and characteristics of maskers, which produce energetic, envelope, and/or informational masking. In this study, speech recognition thresholds (SRTs) were measured with one, two, or four maskers. The target and masker sex was the same or different, and SRTs were measured with time-forward or time-reversed maskers. SRTs were significantly affected by target-masker sex differences with time-forward maskers, but not with time-reversed maskers. The multi-masker penalty was much greater with time-reversed maskers than with time-forward maskers when there were more than two talkers.
Collapse
Affiliation(s)
- Mathew Thomas
- Department of Head and Neck Surgery, David Geffen School of Medicine, 10833 Le Conte Avenue, University of California, Los Angeles, Los Angeles, California 90095, USA
| | - John J Galvin
- House Ear Institute, 2100 West 3rd Street, Suite 111, Los Angeles, California 90057, USA , ,
| | - Qian-Jie Fu
- Department of Head and Neck Surgery, David Geffen School of Medicine, 10833 Le Conte Avenue, University of California, Los Angeles, Los Angeles, California 90095, USA
| |
Collapse
|
15
|
Rodriguez B, Lee J, Lutfi R. Additivity of segregation cues in simulated cocktail-party listening. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 149:82. [PMID: 33514184 PMCID: PMC7787694 DOI: 10.1121/10.0002991] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Revised: 11/30/2020] [Accepted: 12/08/2020] [Indexed: 06/12/2023]
Abstract
An approach is borrowed from Measurement Theory [Krantz et al. (1971). Foundations of Measurement (Academic, New York), Vol. 1] to evaluate the interaction of voice fundamental frequency and spatial cues in the segregation of talkers in simulated cocktail-party listening. The goal is to find a mathematical expression whereby the combined effect of cues can be simply related to their individual effects. On each trial, the listener judged whether an interleaved sequence of four vowel triplets (heard over headphones) were spoken by the same (MMM) or different (FMF) talkers. The talkers had nominally different fundamental frequencies and spoke from nominally different locations (simulated using head-related transfer functions). Natural variation in these cues was simulated by adding a small, random perturbation to the nominal values independently for each vowel on each trial. Psychometric functions (PFs) relating d' performance to the difference in nominal values were obtained for the cues presented individually and in combination. The results revealed a synergistic interaction of cues wherein the PFs for cues presented in combination exceeded the simple vector sum of the PFs for the cues presented individually. The results are discussed in terms of their implications for possible emergent properties of cues affecting performance in simulated cocktail-party listening.
Collapse
Affiliation(s)
- Briana Rodriguez
- Department of Communication Sciences and Disorders, University of South Florida, Tampa, Florida 33620, USA
| | - Jungmee Lee
- Department of Communication Sciences and Disorders, University of South Florida, Tampa, Florida 33620, USA
| | - Robert Lutfi
- Department of Communication Sciences and Disorders, University of South Florida, Tampa, Florida 33620, USA
| |
Collapse
|
16
|
Mesik J, Wojtczak M. Effects of noise precursors on the detection of amplitude and frequency modulation for tones in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 148:3581. [PMID: 33379905 PMCID: PMC8097715 DOI: 10.1121/10.0002879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 11/05/2020] [Accepted: 11/16/2020] [Indexed: 06/12/2023]
Abstract
Recent studies on amplitude modulation (AM) detection for tones in noise reported that AM-detection thresholds improve when the AM stimulus is preceded by a noise precursor. The physiological mechanisms underlying this AM unmasking are unknown. One possibility is that adaptation to the level of the noise precursor facilitates AM encoding by causing a shift in neural rate-level functions to optimize level encoding around the precursor level. The aims of this study were to investigate whether such a dynamic-range adaptation is a plausible mechanism for the AM unmasking and whether frequency modulation (FM), thought to be encoded via AM, also exhibits the unmasking effect. Detection thresholds for AM and FM of tones in noise were measured with and without a fixed-level precursor. Listeners showing the unmasking effect were then tested with the precursor level roved over a wide range to modulate the effect of adaptation to the precursor level on the detection of the subsequent AM. It was found that FM detection benefits from a precursor and the magnitude of FM unmasking correlates with that of AM unmasking. Moreover, consistent with dynamic-range adaptation, the unmasking magnitude weakens as the level difference between the precursor and simultaneous masker of the tone increases.
Collapse
Affiliation(s)
- Juraj Mesik
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Magdalena Wojtczak
- Department of Psychology, University of Minnesota, Minneapolis, Minnesota 55455, USA
| |
Collapse
|
17
|
Kidd G, Jennings TR, Byrne AJ. Enhancing the perceptual segregation and localization of sound sources with a triple beamformer. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 148:3598. [PMID: 33379918 PMCID: PMC8097713 DOI: 10.1121/10.0002779] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Revised: 11/04/2020] [Accepted: 11/09/2020] [Indexed: 06/01/2023]
Abstract
A triple beamformer was developed to exploit the capabilities of the binaural auditory system. The goal was to enhance the perceptual segregation of spatially separated sound sources while preserving source localization. The triple beamformer comprised a variant of a standard single-channel beamformer that routes the primary beam output focused on the target source location to both ears. The triple beam algorithm adds two supplementary beams with the left-focused beam routed only to the left ear and the right-focused beam routed only to the right ear. The rationale for the approach is that the triple beam processing exploits sound source segregation in high informational masking (IM) conditions. Furthermore, the exaggerated interaural level differences produced by the triple beam are well-suited for categories of listeners (e.g., bilateral cochlear implant users) who receive limited benefit from interaural time differences. The performance with the triple beamformer was compared to normal binaural hearing (simulated using a Knowles Electronic Manikin for Auditory Research, G.R.A.S. Sound and Vibration, Holte, DK) and to that obtained from a single-channel beamformer. Source localization in azimuth and masked speech identification for multiple masker locations were measured for all three algorithms. Taking both localization and speech intelligibility into account, the triple beam algorithm was considered to be advantageous under high IM listening conditions.
Collapse
Affiliation(s)
- Gerald Kidd
- Department of Speech, Language and Hearing Sciences and Hearing Research Center, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| | - Todd R Jennings
- Department of Speech, Language and Hearing Sciences and Hearing Research Center, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| | - Andrew J Byrne
- Department of Speech, Language and Hearing Sciences and Hearing Research Center, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| |
Collapse
|
18
|
Villard S, Kidd G. Assessing the benefit of acoustic beamforming for listeners with aphasia using modified psychoacoustic methods. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 148:2894. [PMID: 33261373 PMCID: PMC8097716 DOI: 10.1121/10.0002454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Revised: 10/13/2020] [Accepted: 10/14/2020] [Indexed: 06/12/2023]
Abstract
Acoustic beamforming has been shown to improve identification of target speech in noisy listening environments for individuals with sensorineural hearing loss. This study examined whether beamforming would provide a similar benefit for individuals with aphasia (acquired neurological language impairment). The benefit of beamforming was examined for persons with aphasia (PWA) and age- and hearing-matched controls in both a speech masking condition and a speech-shaped, speech-modulated noise masking condition. Performance was measured when natural spatial cues were provided, as well as when the target speech level was enhanced via a single-channel beamformer. Because typical psychoacoustic methods may present substantial experimental confounds for PWA, clinically guided modifications of experimental procedures were determined individually for each PWA participant. Results indicated that the beamformer provided a significant overall benefit to listeners. On an individual level, both PWA and controls who exhibited poorer performance on the speech masking condition with spatial cues benefited from the beamformer, while those who achieved better performance with spatial cues did not. All participants benefited from the beamformer in the noise masking condition. The findings suggest that a spatially tuned hearing aid may be beneficial for older listeners with relatively mild hearing loss who have difficulty taking advantage of spatial cues.
Collapse
Affiliation(s)
- Sarah Villard
- Department of Speech, Language, and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| | - Gerald Kidd
- Department of Speech, Language, and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| |
Collapse
|
19
|
Abstract
Being able to pick out particular sounds, such as speech, against a background of other sounds represents one of the key tasks performed by the auditory system. Understanding how this happens is important because speech recognition in noise is particularly challenging for older listeners and for people with hearing impairments. Central to this ability is the capacity of neurons to adapt to the statistics of sounds reaching the ears, which helps to generate noise-tolerant representations of sounds in the brain. In more complex auditory scenes, such as a cocktail party — where the background noise comprises other voices, sound features associated with each source have to be grouped together and segregated from those belonging to other sources. This depends on precise temporal coding and modulation of cortical response properties when attending to a particular speaker in a multi-talker environment. Furthermore, the neural processing underlying auditory scene analysis is shaped by experience over multiple timescales.
Collapse
|
20
|
Zhang J, Wang X, Wang NY, Fu X, Gan T, Galvin JJ, Willis S, Xu K, Thomas M, Fu QJ. Tonal Language Speakers Are Better Able to Segregate Competing Speech According to Talker Sex Differences. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2020; 63:2801-2810. [PMID: 32692939 PMCID: PMC7872724 DOI: 10.1044/2020_jslhr-19-00421] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Revised: 04/01/2020] [Accepted: 05/15/2020] [Indexed: 06/01/2023]
Abstract
Purpose The aim of this study was to compare release from masking (RM) between Mandarin-speaking and English-speaking listeners with normal hearing for competing speech when target-masker sex cues, spatial cues, or both were available. Method Speech recognition thresholds (SRTs) for competing speech were measured in 21 Mandarin-speaking and 15 English-speaking adults with normal hearing using a modified coordinate response measure task. SRTs were measured for target sentences produced by a male talker in the presence of two masker talkers (different male talkers or female talkers). The target sentence was always presented directly in front of the listener, and the maskers were either colocated with the target or were spatially separated from the target (+90°, -90°). Stimuli were presented via headphones and were virtually spatialized using head-related transfer functions. Three masker conditions were used to measure RM relative to the baseline condition: (a) talker sex cues, (b) spatial cues, or (c) combined talker sex and spatial cues. Results The results showed large amounts of RM according to talker sex and/or spatial cues. There was no significant difference in SRTs between Chinese and English listeners for the baseline condition, where no talker sex or spatial cues were available. Furthermore, there was no significant difference in RM between Chinese and English listeners when spatial cues were available. However, RM was significantly larger for Chinese listeners when talker sex cues or combined talker sex and spatial cues were available. Conclusion Listeners who speak a tonal language such as Mandarin Chinese may be able to take greater advantage of talker sex cues than listeners who do not speak a tonal language.
Collapse
Affiliation(s)
- Juan Zhang
- Department of Otolaryngology, Head and Neck Surgery, Beijing Chaoyang Hospital, Capital Medical University, China
| | - Xing Wang
- Department of Otolaryngology, Head and Neck Surgery, Beijing Chaoyang Hospital, Capital Medical University, China
| | - Ning-yu Wang
- Department of Otolaryngology, Head and Neck Surgery, Beijing Chaoyang Hospital, Capital Medical University, China
| | - Xin Fu
- Department of Otolaryngology, Head and Neck Surgery, Beijing Chaoyang Hospital, Capital Medical University, China
| | - Tian Gan
- Department of Otolaryngology, Head and Neck Surgery, Beijing Chaoyang Hospital, Capital Medical University, China
| | | | - Shelby Willis
- Department of Head and Neck Surgery, David Geffen School of Medicine, University of California, Los Angeles
| | - Kevin Xu
- Department of Head and Neck Surgery, David Geffen School of Medicine, University of California, Los Angeles
| | - Mathew Thomas
- Department of Head and Neck Surgery, David Geffen School of Medicine, University of California, Los Angeles
| | - Qian-Jie Fu
- Department of Head and Neck Surgery, David Geffen School of Medicine, University of California, Los Angeles
| |
Collapse
|
21
|
Paulus M, Hazan V, Adank P. The relationship between talker acoustics, intelligibility, and effort in degraded listening conditions. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:3348. [PMID: 32486777 DOI: 10.1121/10.0001212] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Accepted: 04/20/2020] [Indexed: 06/11/2023]
Abstract
Listening to degraded speech is associated with decreased intelligibility and increased effort. However, listeners are generally able to adapt to certain types of degradations. While intelligibility of degraded speech is modulated by talker acoustics, it is unclear whether talker acoustics also affect effort and adaptation. Moreover, it has been demonstrated that talker differences are preserved across spectral degradations, but it is not known whether this effect extends to temporal degradations and which acoustic-phonetic characteristics are responsible. In a listening experiment combined with pupillometry, participants were presented with speech in quiet as well as in masking noise, time-compressed, and noise-vocoded speech by 16 Southern British English speakers. Results showed that intelligibility, but not adaptation, was modulated by talker acoustics. Talkers who were more intelligible under noise-vocoding were also more intelligible under masking and time-compression. This effect was linked to acoustic-phonetic profiles with greater vowel space dispersion (VSD) and energy in mid-range frequencies, as well as slower speaking rate. While pupil dilation indicated increasing effort with decreasing intelligibility, this study also linked reduced effort in quiet to talkers with greater VSD. The results emphasize the relevance of talker acoustics for intelligibility and effort in degraded listening conditions.
Collapse
Affiliation(s)
- Maximillian Paulus
- Speech, Hearing and Phonetic Sciences, University College London, London, United Kingdom
| | - Valerie Hazan
- Speech, Hearing and Phonetic Sciences, University College London, London, United Kingdom
| | - Patti Adank
- Speech, Hearing and Phonetic Sciences, University College London, London, United Kingdom
| |
Collapse
|
22
|
Conroy C, Best V, Jennings TR, Kidd G. The importance of processing resolution in "ideal time-frequency segregation" of masked speech and the implications for predicting speech intelligibility. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:1648. [PMID: 32237827 PMCID: PMC7075715 DOI: 10.1121/10.0000893] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Revised: 01/28/2020] [Accepted: 02/24/2020] [Indexed: 06/11/2023]
Abstract
Ideal time-frequency segregation (ITFS) is a signal processing technique that may be used to estimate the energetic and informational components of speech-on-speech masking. A core assumption of ITFS is that it roughly emulates the effects of energetic masking (EM) in a speech mixture. Thus, when speech identification thresholds are measured for ITFS-processed stimuli and compared to thresholds for unprocessed stimuli, the difference can be attributed to informational masking (IM). Interpreting this difference as a direct metric of IM, however, is complicated by the fine time-frequency (T-F) resolution typically used during ITFS, which may yield target "glimpses" that are too narrow/brief to be resolved by the ear in the mixture. Estimates of IM, therefore, may be inflated because the full effects of EM are not accounted for. Here, T-F resolution was varied during ITFS to determine if/how estimates of IM depend on processing resolution. Speech identification thresholds were measured for speech and noise maskers after ITFS. Reduced frequency resolution yielded poorer thresholds for both masker types. Reduced temporal resolution did so for noise maskers only. Results suggest that processing resolution strongly influences estimates of IM and implies that current approaches to predicting masked speech intelligibility should be modified to account for IM.
Collapse
Affiliation(s)
- Christopher Conroy
- Department of Speech, Language and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| | - Virginia Best
- Department of Speech, Language and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| | - Todd R Jennings
- Department of Speech, Language and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| | - Gerald Kidd
- Department of Speech, Language and Hearing Sciences, Boston University, 635 Commonwealth Avenue, Boston, Massachusetts 02215, USA
| |
Collapse
|
23
|
Villard S, Kidd G. Effects of Acquired Aphasia on the Recognition of Speech Under Energetic and Informational Masking Conditions. Trends Hear 2019; 23:2331216519884480. [PMID: 31694486 PMCID: PMC7000861 DOI: 10.1177/2331216519884480] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2019] [Revised: 09/24/2019] [Accepted: 10/01/2019] [Indexed: 11/16/2022] Open
Abstract
Persons with aphasia (PWA) often report difficulty understanding spoken language in noisy environments that require listeners to identify and selectively attend to target speech while ignoring competing background sounds or “maskers.” This study compared the performance of PWA and age-matched healthy controls (HC) on a masked speech identification task and examined the consequences of different types of masking on performance. Twelve PWA and 12 age-matched HC completed a speech identification task comprising three conditions designed to differentiate between the effects of energetic and informational masking on receptive speech processing. The target and masker speech materials were taken from a closed-set matrix-style corpus, and a forced-choice word identification task was used. Target and maskers were spatially separated from one another in order to simulate real-world listening environments and allow listeners to make use of binaural cues for source segregation. Individualized frequency-specific gain was applied to compensate for the effects of hearing loss. Although both groups showed similar susceptibility to the effects of energetic masking, PWA were more susceptible than age-matched HC to the effects of informational masking. Results indicate that this increased susceptibility cannot be attributed to age, hearing loss, or comprehension deficits and is therefore a consequence of acquired cognitive-linguistic impairments associated with aphasia. This finding suggests that aphasia may result in increased difficulty segregating target speech from masker speech, which in turn may have implications for the ability of PWA to comprehend target speech in multitalker environments, such as restaurants, family gatherings, and other everyday situations.
Collapse
Affiliation(s)
- Sarah Villard
- Department of Speech, Language & Hearing Sciences,
Boston University, MA, USA
| | - Gerald Kidd
- Department of Speech, Language & Hearing Sciences,
Boston University, MA, USA
| |
Collapse
|