1
|
Shahidi LK, Collins LM, Mainsah BO. Objective intelligibility measurement of reverberant vocoded speech for normal-hearing listeners: Towards facilitating the development of speech enhancement algorithms for cochlear implants. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:2151-2168. [PMID: 38501923 PMCID: PMC10959555 DOI: 10.1121/10.0025285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 12/29/2023] [Accepted: 02/24/2024] [Indexed: 03/20/2024]
Abstract
Cochlear implant (CI) recipients often struggle to understand speech in reverberant environments. Speech enhancement algorithms could restore speech perception for CI listeners by removing reverberant artifacts from the CI stimulation pattern. Listening studies, either with cochlear-implant recipients or normal-hearing (NH) listeners using a CI acoustic model, provide a benchmark for speech intelligibility improvements conferred by the enhancement algorithm but are costly and time consuming. To reduce the associated costs during algorithm development, speech intelligibility could be estimated offline using objective intelligibility measures. Previous evaluations of objective measures that considered CIs primarily assessed the combined impact of noise and reverberation and employed highly accurate enhancement algorithms. To facilitate the development of enhancement algorithms, we evaluate twelve objective measures in reverberant-only conditions characterized by a gradual reduction of reverberant artifacts, simulating the performance of an enhancement algorithm during development. Measures are validated against the performance of NH listeners using a CI acoustic model. To enhance compatibility with reverberant CI-processed signals, measure performance was assessed after modifying the reference signal and spectral filterbank. Measures leveraging the speech-to-reverberant ratio, cepstral distance and, after modifying the reference or filterbank, envelope correlation are strong predictors of intelligibility for reverberant CI-processed speech.
Collapse
Affiliation(s)
- Lidea K Shahidi
- Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina 27701, USA
| | - Leslie M Collins
- Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina 27701, USA
| | - Boyla O Mainsah
- Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina 27701, USA
| |
Collapse
|
2
|
Shahidi LK, Collins LM, Mainsah BO. Application of a Graphical Model to Investigate the Utility of Cross-channel Information for Mitigating Reverberation in Cochlear Implants. PROCEEDINGS OF THE ... INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS. INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS 2018; 2018:847-852. [PMID: 32016173 DOI: 10.1109/icmla.2018.00136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Individuals with cochlear implants (CIs) experience more difficulty understanding speech in reverberant environ-ments than normal hearing listeners. As a result, recent research has targeted mitigating the effects of late reverberant signal reflections in CIs by using a machine learning approach to detect and delete affected segments in the CI stimulus pattern. Previous work has trained electrode-specific classification models to mitigate late reverberant signal reflections based on features extracted from only the acoustic activity within the electrode of interest. Since adjacent CI electrodes tend to be activated concurrently during speech, we hypothesized that incorporating additional information from the other electrode channels, termed cross-channel information, as features could improve classification performance. Cross-channel information extracted in real-world conditions will likely contain errors that will impact classification performance. To simulate extracting cross-channel information in realistic conditions, we developed a graphical model based on the Ising model to systematically introduce errors to specific types of cross-channel information. The Ising-like model allows us to add errors while maintaining the important geometric information contained in cross-channel information, which is due to the spectro-temporal structure of speech. Results suggest the potential utility of leveraging cross-channel information to improve the performance of the reverberation mitigation algorithm from the baseline channel-based features, even when the cross-channel information contains errors.
Collapse
Affiliation(s)
- Lidea K Shahidi
- Electrical and Computer Engineering, Duke University, Durham, USA
| | - Leslie M Collins
- Electrical and Computer Engineering, Duke University, Durham, USA
| | - Boyla O Mainsah
- Electrical and Computer Engineering, Duke University, Durham, USA
| |
Collapse
|
3
|
Bentsen T, Kressner AA, Dau T, May T. The impact of exploiting spectro-temporal context in computational speech segregation. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 143:248. [PMID: 29390791 DOI: 10.1121/1.5020273] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Computational speech segregation aims to automatically segregate speech from interfering noise, often by employing ideal binary mask estimation. Several studies have tried to exploit contextual information in speech to improve mask estimation accuracy by using two frequently-used strategies that (1) incorporate delta features and (2) employ support vector machine (SVM) based integration. In this study, two experiments were conducted. In Experiment I, the impact of exploiting spectro-temporal context using these strategies was investigated in stationary and six-talker noise. In Experiment II, the delta features were explored in detail and tested in a setup that considered novel noise segments of the six-talker noise. Computing delta features led to higher intelligibility than employing SVM based integration and intelligibility increased with the amount of spectral information exploited via the delta features. The system did not, however, generalize well to novel segments of this noise type. Measured intelligibility was subsequently compared to extended short-term objective intelligibility, hit-false alarm rate, and the amount of mask clustering. None of these objective measures alone could account for measured intelligibility. The findings may have implications for the design of speech segregation systems, and for the selection of a cost function that correlates with intelligibility.
Collapse
Affiliation(s)
- Thomas Bentsen
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Abigail A Kressner
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Torsten Dau
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Tobias May
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| |
Collapse
|
4
|
Chen F. Representing the intelligibility advantage of ideal binary masking with the most energetic channels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 140:4161. [PMID: 28040025 DOI: 10.1121/1.4971206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
This study investigates how the intelligibility advantage of ideal binary mask (IBM) processing in synthesizing speech is affected by the use of a small number of the most energetic channels. In experiment 1, IBM-processed Mandarin speech that had been corrupted by speech spectrum-shaped noise or two-talker babble was synthesized by using as few as four of the most energetic target-dominated channels at each frame. This approach provided intelligibility comparable to that of speech synthesized with all of the target-dominated channels. Experiments 2, 3, and 4 examined how the intelligibility advantage of IBM processing from experiment 1 was affected by the local SNR threshold, low-frequency region (LFR) cut-off frequency, and vowel-based segmentation, respectively. Experiments 2 and 3 showed that a threshold of 0 dB for local SNR and a cutoff of 3000 Hz for LFR were optimal choices for improving the intelligibility of IBM processing based on the most energetic channels. Experiment 4 found that the intelligibility advantage of IBM processing with the most energetic channels was preserved at the segmental level of vowel-only IBM-processed speech. Taken together, the results suggest that compared to IBM-processed speech synthesized with all of the target-dominated channels, Mandarin speech synthesized by selecting a small number of the most energetic target-dominated channels can achieve similar levels of intelligibility.
Collapse
Affiliation(s)
- Fei Chen
- Department of Electrical and Electronic Engineering, Southern University of Science and Technology, Shenzhen, China
| |
Collapse
|
5
|
Mi J, Colburn HS. A Binaural Grouping Model for Predicting Speech Intelligibility in Multitalker Environments. Trends Hear 2016; 20:20/0/2331216516669919. [PMID: 27698261 PMCID: PMC5051670 DOI: 10.1177/2331216516669919] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Spatially separating speech maskers from target speech often leads to a large intelligibility improvement. Modeling this phenomenon has long been of interest to binaural-hearing researchers for uncovering brain mechanisms and for improving signal-processing algorithms in hearing-assistive devices. Much of the previous binaural modeling work focused on the unmasking enabled by binaural cues at the periphery, and little quantitative modeling has been directed toward the grouping or source-separation benefits of binaural processing. In this article, we propose a binaural model that focuses on grouping, specifically on the selection of time-frequency units that are dominated by signals from the direction of the target. The proposed model uses Equalization-Cancellation (EC) processing with a binary decision rule to estimate a time-frequency binary mask. EC processing is carried out to cancel the target signal and the energy change between the EC input and output is used as a feature that reflects target dominance in each time-frequency unit. The processing in the proposed model requires little computational resources and is straightforward to implement. In combination with the Coherence-based Speech Intelligibility Index, the model is applied to predict the speech intelligibility data measured by Marrone et al. The predicted speech reception threshold matches the pattern of the measured data well, even though the predicted intelligibility improvements relative to the colocated condition are larger than some of the measured data, which may reflect the lack of internal noise in this initial version of the model.
Collapse
Affiliation(s)
- Jing Mi
- Boston University, Boston, MA, USA
| | | |
Collapse
|
6
|
Kressner AA, May T, Rozell CJ. Outcome measures based on classification performance fail to predict the intelligibility of binary-masked speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 139:3033. [PMID: 27369123 DOI: 10.1121/1.4952439] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
To date, the most commonly used outcome measure for assessing ideal binary mask estimation algorithms is based on the difference between the hit rate and the false alarm rate (H-FA). Recently, the error distribution has been shown to substantially affect intelligibility. However, H-FA treats each mask unit independently and does not take into account how errors are distributed. Alternatively, algorithms can be evaluated with the short-time objective intelligibility (STOI) metric using the reconstructed speech. This study investigates the ability of H-FA and STOI to predict intelligibility for binary-masked speech using masks with different error distributions. The results demonstrate the inability of H-FA to predict the behavioral intelligibility and also illustrate the limitations of STOI. Since every estimation algorithm will make errors that are distributed in different ways, performance evaluations should not be made solely on the basis of these metrics.
Collapse
Affiliation(s)
- Abigail Anne Kressner
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Tobias May
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Christopher J Rozell
- School of Electrical and Computer Engineering, 777 Atlantic Drive NW, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| |
Collapse
|
7
|
Kressner AA, Westermann A, Buchholz JM, Rozell CJ. Cochlear implant speech intelligibility outcomes with structured and unstructured binary mask errors. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 139:800-810. [PMID: 26936562 DOI: 10.1121/1.4941567] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
It has been shown that intelligibility can be improved for cochlear implant (CI) recipients with the ideal binary mask (IBM). In realistic scenarios where prior information is unavailable, however, the IBM must be estimated, and these estimations will inevitably contain errors. Although the effects of both unstructured and structured binary mask errors have been investigated with normal-hearing (NH) listeners, they have not been investigated with CI recipients. This study assesses these effects with CI recipients using masks that have been generated systematically with a statistical model. The results demonstrate that clustering of mask errors substantially decreases the tolerance of errors, that incorrectly removing target-dominated regions can be as detrimental to intelligibility as incorrectly adding interferer-dominated regions, and that the individual tolerances of the different types of errors can change when both are present. These trends follow those of NH listeners. However, analysis with a mixed effects model suggests that CI recipients tend to be less tolerant than NH listeners to mask errors in most conditions, at least with respect to the testing methods in each of the studies. This study clearly demonstrates that structure influences the tolerance of errors and therefore should be considered when analyzing binary-masking algorithms.
Collapse
Affiliation(s)
- Abigail A Kressner
- National Acoustic Laboratories, Australian Hearing, 16 University Avenue, Macquarie University, New South Wales 2109, Australia
| | - Adam Westermann
- National Acoustic Laboratories, Australian Hearing, 16 University Avenue, Macquarie University, New South Wales 2109, Australia
| | - Jörg M Buchholz
- National Acoustic Laboratories, Australian Hearing, 16 University Avenue, Macquarie University, New South Wales 2109, Australia
| | - Christopher J Rozell
- School of Electrical and Computer Engineering, 777 Atlantic Drive Northwest, Georgia Institute of Technology, Atlanta, Georgia 30332, USA
| |
Collapse
|