1
|
van Schoonhoven J, Rhebergen KS, Dreschler WA. A context-based model to predict the intelligibility of sentences in non-stationary noises. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:2849-2859. [PMID: 38682914 DOI: 10.1121/10.0025772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Accepted: 04/03/2024] [Indexed: 05/01/2024]
Abstract
The context-based Extended Speech Transmission Index (cESTI) (van Schoonhoven et al., 2022, J. Acoust. Soc. Am. 151, 1404-1415) was successfully applied to predict the intelligibility of monosyllabic words with different degrees of context in interrupted noise. The current study aimed to use the same model for the prediction of sentence intelligibility in different types of non-stationary noise. The necessary context factors and transfer functions were based on values found in existing literature. The cESTI performed similar to or better than the original ESTI when noise had speech-like characteristics. We hypothesize that the remaining inaccuracies in model predictions can be attributed to the limits of the modelling approach with regard to mechanisms, such as modulation masking and informational masking.
Collapse
Affiliation(s)
- Jelmer van Schoonhoven
- Department of Clinical and Experimental Audiology, Amsterdam University Medical Center, 1105 AZ Amsterdam, The Netherlands
| | - Koenraad S Rhebergen
- Department of Otorhinolaryngology and Head & Neck Surgery, Rudolf Magnus Institute of Neuroscience, University Medical Center Utrecht, Postbus 85500, 3508 GA Utrecht, The Netherlands
| | - Wouter A Dreschler
- Department of Clinical and Experimental Audiology, Amsterdam University Medical Center, 1105 AZ Amsterdam, The Netherlands
| |
Collapse
|
2
|
Yang Y, Zeng FG. Syllable-rate-adjusted-modulation (SRAM) predicts clear and conversational speech intelligibility. Front Hum Neurosci 2024; 18:1324027. [PMID: 38410256 PMCID: PMC10895021 DOI: 10.3389/fnhum.2024.1324027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 01/17/2024] [Indexed: 02/28/2024] Open
Abstract
Introduction Objectively predicting speech intelligibility is important in both telecommunication and human-machine interaction systems. The classic method relies on signal-to-noise ratios (SNR) to successfully predict speech intelligibility. One exception is clear speech, in which a talker intentionally articulates as if speaking to someone who has hearing loss or is from a different language background. As a result, at the same SNR, clear speech produces higher intelligibility than conversational speech. Despite numerous efforts, no objective metric can successfully predict the clear speech benefit at the sentence level. Methods We proposed a Syllable-Rate-Adjusted-Modulation (SRAM) index to predict the intelligibility of clear and conversational speech. The SRAM used as short as 1 s speech and estimated its modulation power above the syllable rate. We compared SRAM with three reference metrics: envelope-regression-based speech transmission index (ER-STI), hearing-aid speech perception index version 2 (HASPI-v2) and short-time objective intelligibility (STOI), and five automatic speech recognition systems: Amazon Transcribe, Microsoft Azure Speech-To-Text, Google Speech-To-Text, wav2vec2 and Whisper. Results SRAM outperformed the three reference metrics (ER-STI, HASPI-v2 and STOI) and the five automatic speech recognition systems. Additionally, we demonstrated the important role of syllable rate in predicting speech intelligibility by comparing SRAM with the total modulation power (TMP) that was not adjusted by the syllable rate. Discussion SRAM can potentially help understand the characteristics of clear speech, screen speech materials with high intelligibility, and convert conversational speech into clear speech.
Collapse
Affiliation(s)
- Ye Yang
- Department of Biomedical Engineering, University of California, Irvine, Irvine, CA, United States
| | - Fan-Gang Zeng
- Department of Biomedical Engineering, University of California, Irvine, Irvine, CA, United States
- Department of Otolaryngology-Head and Neck Surgery, University of California, Irvine, Irvine, CA, United States
| |
Collapse
|
3
|
Kates JM. Extending the Hearing-Aid Speech Perception Index (HASPI): Keywords, sentences, and context. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:1662. [PMID: 37002064 PMCID: PMC10257526 DOI: 10.1121/10.0017546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 02/16/2023] [Accepted: 02/22/2023] [Indexed: 05/18/2023]
Abstract
The Hearing-Aid Speech Perception Index version 2 (HASPI v2) is a speech intelligibility metric derived by fitting subject responses scored as the proportion of complete sentences correct. This paper presents an extension of HASPI v2, denoted by HASPI w2, which predicts proportion keywords correct for the same datasets used to derive HASPI v2. The results show that the accuracy of HASPI w2 is nearly identical to that of HASPI v2. The values produced by HASPI w2 and HASPI v2 also allow the comparison of proportion words correct and sentences correct for the same stimuli. Using simulation values for speech in additive noise, a model of context effects for words combined into sentences is developed and accounts for the loss of intelligibility inherent in the impaired auditory periphery. In addition, HASPI w2 and HASPI v2 have a small bias term at poor signal-to-noise ratios; the model for context effects shows that the residual bias is reduced in converting from proportion keywords to sentences correct but is greatly magnified when considering the reverse transformation.
Collapse
Affiliation(s)
- James M Kates
- Department of Speech, Language, and Hearing Sciences, University of Colorado, Boulder, Colorado 80309, USA
| |
Collapse
|
4
|
Kurowski A, Kotus J, Odya P, Kostek B. A Novel Method for Intelligibility Assessment of Nonlinearly Processed Speech in Spaces Characterized by Long Reverberation Times. SENSORS 2022; 22:s22041641. [PMID: 35214543 PMCID: PMC8880044 DOI: 10.3390/s22041641] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 02/11/2022] [Accepted: 02/17/2022] [Indexed: 01/27/2023]
Abstract
Objective assessment of speech intelligibility is a complex task that requires taking into account a number of factors such as different perception of each speech sub-bands by the human hearing sense or different physical properties of each frequency band of a speech signal. Currently, the state-of-the-art method used for assessing the quality of speech transmission is the speech transmission index (STI). It is a standardized way of objectively measuring the quality of, e.g., an acoustical adaptation of conference rooms or public address systems. The wide use of this measure and implementation of this method on numerous measurement devices make STI a popular choice when the speech-related quality of rooms has to be estimated. However, the STI measure has a significant drawback which excludes it from some particular use cases. For instance, if one would like to enhance speech intelligibility by employing a nonlinear digital processing algorithm, the STI method is not suitable to measure the impact of such an algorithm, as it requires that the measurement signal should not be altered in a nonlinear way. Consequently, if a nonlinear speech enhancing algorithm has to be tested, the STI—a standard way of estimating speech transmission cannot be used. In this work, we would like to propose a method based on the STI method but modified in such a way that it makes it possible to employ it for the estimation of the performance of the nonlinear speech intelligibility enhancement method. The proposed approach is based upon a broadband comparison of cumulated energy of the transmitted envelope modulation and the received modulation, so we called it broadband STI (bSTI). Its credibility with regard to signals altered by the environment or nonlinear speech changed by a DSP algorithm is checked by performing a comparative analysis of ten selected impulse responses for which a baseline value of STI was known.
Collapse
Affiliation(s)
- Adam Kurowski
- Department of Multimedia Systems, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 11/12 Narutowicza Street, 80-233 Gdansk, Poland; (A.K.); (J.K.); (P.O.)
| | - Jozef Kotus
- Department of Multimedia Systems, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 11/12 Narutowicza Street, 80-233 Gdansk, Poland; (A.K.); (J.K.); (P.O.)
| | - Piotr Odya
- Department of Multimedia Systems, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 11/12 Narutowicza Street, 80-233 Gdansk, Poland; (A.K.); (J.K.); (P.O.)
| | - Bozena Kostek
- Audio Acoustics Laboratory, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, 11/12 Narutowicza Street, 80-233 Gdansk, Poland
- Correspondence:
| |
Collapse
|
5
|
Graetzer S, Hopkins C. Intelligibility prediction for speech mixed with white Gaussian noise at low signal-to-noise ratios. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 149:1346. [PMID: 33639794 DOI: 10.1121/10.0003557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Accepted: 01/28/2021] [Indexed: 06/12/2023]
Abstract
The effect of additive white Gaussian noise and high-pass filtering on speech intelligibility at signal-to-noise ratios (SNRs) from -26 to 0 dB was evaluated using British English talkers and normal hearing listeners. SNRs below -10 dB were considered as they are relevant to speech security applications. Eight objective metrics were assessed: short-time objective intelligibility (STOI), a proposed variant termed STOI+, extended short-time objective intelligibility (ESTOI), normalised covariance metric (NCM), normalised subband envelope correlation metric (NSEC), two metrics derived from the coherence speech intelligibility index (CSII), and an envelope-based regression method speech transmission index (STI). For speech and noise mixtures associated with intelligibility scores ranging from 0% to 98%, STOI+ performed at least as well as other metrics and, under some conditions, better than STOI, ESTOI, STI, NSEC, CSIIMid, and CSIIHigh. Both STOI+ and NCM were associated with relatively low prediction error and bias for intelligibility prediction at SNRs from -26 to 0 dB. STI performed least well in terms of correlation with intelligibility scores, prediction error, bias, and reliability. Logistic regression modeling demonstrated that high-pass filtering, which increases the proportion of high to low frequency energy, was detrimental to intelligibility for SNRs between -5 and -17 dB inclusive.
Collapse
Affiliation(s)
- Simone Graetzer
- Acoustics Research Unit, School of Architecture, University of Liverpool, Liverpool L69 7ZN, United Kingdom
| | - Carl Hopkins
- Acoustics Research Unit, School of Architecture, University of Liverpool, Liverpool L69 7ZN, United Kingdom
| |
Collapse
|
6
|
Muralimanohar RK, Kates JM, Arehart KH. Using envelope modulation to explain speech intelligibility in the presence of a single reflection. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 141:EL482. [PMID: 28599537 DOI: 10.1121/1.4983630] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
A single reflection is the simplest simulation of reverberation and provides insights into more complex scenarios of listening in rooms. This paper presents an analysis of the effects of a single reflection as its delay and intensity are systematically varied. The changes to the envelope modulations are analyzed using not only the traditional within-auditory-band analysis approach but also an across-band spectro-temporal analysis using cepstral correlation coefficients. The use of an auditory model allowed the extension of the simulations to include sensorineural hearing loss. Short delays did not interfere with the envelope modulations at low modulation rates (<16 Hz) and impact predicted intelligibility, while longer delays caused substantial distortion at these rates. The patterns of envelope modulation distortions caused by a single reflection were shown to be similar in models of normal hearing and hearing impairment.
Collapse
Affiliation(s)
- Ramesh Kumar Muralimanohar
- Department of Speech, Language and Hearing Sciences, 2501 Kittredge Loop Road 409 UCB, University of Colorado, Boulder, Colorado 80309, USA , ,
| | - James M Kates
- Department of Speech, Language and Hearing Sciences, 2501 Kittredge Loop Road 409 UCB, University of Colorado, Boulder, Colorado 80309, USA , ,
| | - Kathryn H Arehart
- Department of Speech, Language and Hearing Sciences, 2501 Kittredge Loop Road 409 UCB, University of Colorado, Boulder, Colorado 80309, USA , ,
| |
Collapse
|
7
|
Mechergui N, Djaziri-Larbi S, Jaïdane M. Speech based transmission index for all: An intelligibility metric for variable hearing ability. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 141:1470. [PMID: 28372108 DOI: 10.1121/1.4976628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
A method to measure the speech intelligibility in public address systems for normal hearing and hearing impaired persons is presented. The proposed metric is an extension of the speech based Speech Transmission Index to account for accurate perceptual masking and variable hearing ability: The sound excitation pattern generated at the ear is accurately computed using an auditory filter model, and its shapes depend on frequency, sound level, and hearing impairment. This extension yields a better prediction of the intensity of auditory masking which is used to rectify the modulation transfer function and thus to objectively assess the speech intelligibility experienced by hearing impaired as well as by normal hearing persons in public spaces. The proposed metric was developed within the framework of the European Active and Assisted Living research program, and was labeled "SB-STI for All." Extensive subjective in-Lab and in vivo tests have been conducted and the proposed metric proved to have a good correlation with subjective intelligibility scores.
Collapse
Affiliation(s)
- Nader Mechergui
- Université Tunis El Manar, Ecole Nationale d'Ingénieurs de Tunis, Signals and Systems Lab, Tunis, Tunisia
| | - Sonia Djaziri-Larbi
- Université Tunis El Manar, Ecole Nationale d'Ingénieurs de Tunis, Signals and Systems Lab, Tunis, Tunisia
| | - Mériem Jaïdane
- Université Tunis El Manar, Ecole Nationale d'Ingénieurs de Tunis, Signals and Systems Lab, Tunis, Tunisia
| |
Collapse
|
8
|
Kates JM, Arehart KH. Comparing the information conveyed by envelope modulation for speech intelligibility, speech quality, and music quality. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 138:2470-82. [PMID: 26520329 PMCID: PMC4627935 DOI: 10.1121/1.4931899] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
This paper uses mutual information to quantify the relationship between envelope modulation fidelity and perceptual responses. Data from several previous experiments that measured speech intelligibility, speech quality, and music quality are evaluated for normal-hearing and hearing-impaired listeners. A model of the auditory periphery is used to generate envelope signals, and envelope modulation fidelity is calculated using the normalized cross-covariance of the degraded signal envelope with that of a reference signal. Two procedures are used to describe the envelope modulation: (1) modulation within each auditory frequency band and (2) spectro-temporal processing that analyzes the modulation of spectral ripple components fit to successive short-time spectra. The results indicate that low modulation rates provide the highest information for intelligibility, while high modulation rates provide the highest information for speech and music quality. The low-to-mid auditory frequencies are most important for intelligibility, while mid frequencies are most important for speech quality and high frequencies are most important for music quality. Differences between the spectral ripple components used for the spectro-temporal analysis were not significant in five of the six experimental conditions evaluated. The results indicate that different modulation-rate and auditory-frequency weights may be appropriate for indices designed to predict different types of perceptual relationships.
Collapse
Affiliation(s)
- James M Kates
- Department of Speech Language and Hearing Sciences, University of Colorado, Boulder, Colorado 80309, USA
| | - Kathryn H Arehart
- Department of Speech Language and Hearing Sciences, University of Colorado, Boulder, Colorado 80309, USA
| |
Collapse
|
9
|
Keus van de Poll M, Carlsson J, Marsh JE, Ljung R, Odelius J, Schlittmeier SJ, Sundin G, Sörqvist P. Unmasking the effects of masking on performance: The potential of multiple-voice masking in the office environment. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 138:807-816. [PMID: 26328697 DOI: 10.1121/1.4926904] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Broadband noise is often used as a masking sound to combat the negative consequences of background speech on performance in open-plan offices. As office workers generally dislike broadband noise, it is important to find alternatives that are more appreciated while being at least not less effective. The purpose of experiment 1 was to compare broadband noise with two alternatives-multiple voices and water waves-in the context of a serial short-term memory task. A single voice impaired memory in comparison with silence, but when the single voice was masked with multiple voices, performance was on level with silence. Experiment 2 explored the benefits of multiple-voice masking in more detail (by comparing one voice, three voices, five voices, and seven voices) in the context of word processed writing (arguably a more office-relevant task). Performance (i.e., writing fluency) increased linearly from worst performance in the one-voice condition to best performance in the seven-voice condition. Psychological mechanisms underpinning these effects are discussed.
Collapse
Affiliation(s)
- Marijke Keus van de Poll
- Department of Building, Energy and Environmental Engineering, University of Gävle, Kungsbäcksvägen 47, SE-801 76 Gävle, Sweden
| | - Johannes Carlsson
- Division of Applied Acoustics, Department of Civil and Environmental Engineering, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden
| | - John E Marsh
- School of Psychology, University of Central Lancashire (UCLan), DB 115, Darwin Building, Preston, PR1 2HE, United Kingdom
| | - Robert Ljung
- Department of Building, Energy and Environmental Engineering, University of Gävle, Kungsbäcksvägen 47, SE-801 76 Gävle, Sweden
| | - Johan Odelius
- Department of Civil, Environmental and Natural Resources Engineering, Luleå University of Technology, Laboratorievägen 14, SE-971 87 Luleå, Sweden
| | - Sabine J Schlittmeier
- Work, Environmental and Health Psychology, Catholic University of Eichstätt-Ingolstadt, Ostenstraße 25, 85072 Eichstätt, Germany
| | - Gunilla Sundin
- Akustikon Team in Norconsult AB, Hantverkargatan 5, SE-112 21 Stockholm, Sweden
| | - Patrik Sörqvist
- Department of Building, Energy and Environmental Engineering, University of Gävle, Kungsbäcksvägen 47, SE-801 76 Gävle, Sweden
| |
Collapse
|