1
|
Lee GW, Kim HK. Cluster-Based Pairwise Contrastive Loss for Noise-Robust Speech Recognition. Sensors (Basel) 2024; 24:2573. [PMID: 38676191 PMCID: PMC11054889 DOI: 10.3390/s24082573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 04/08/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024]
Abstract
This paper addresses a joint training approach applied to a pipeline comprising speech enhancement (SE) and automatic speech recognition (ASR) models, where an acoustic tokenizer is included in the pipeline to leverage the linguistic information from the ASR model to the SE model. The acoustic tokenizer takes the outputs of the ASR encoder and provides a pseudo-label through K-means clustering. To transfer the linguistic information, represented by pseudo-labels, from the acoustic tokenizer to the SE model, a cluster-based pairwise contrastive (CBPC) loss function is proposed, which is a self-supervised contrastive loss function, and combined with an information noise contrastive estimation (infoNCE) loss function. This combined loss function prevents the SE model from overfitting to outlier samples and represents the pronunciation variability in samples with the same pseudo-label. The effectiveness of the proposed CBPC loss function is evaluated on a noisy LibriSpeech dataset by measuring both the speech quality scores and the word error rate (WER). The experimental results reveal that the proposed joint training approach using the described CBPC loss function achieves a lower WER than the conventional joint training approaches. In addition, it is demonstrated that the speech quality scores of the SE model trained using the proposed training approach are higher than those of the standalone-SE model and SE models trained using conventional joint training approaches. An ablation study is also conducted to investigate the effects of different combinations of loss functions on the speech quality scores and WER. Here, it is revealed that the proposed CBPC loss function combined with infoNCE contributes to a reduced WER and an increase in most of the speech quality scores.
Collapse
Affiliation(s)
- Geon Woo Lee
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea;
| | - Hong Kook Kim
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea;
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Republic of Korea
- AunionAI Co., Ltd., Gwangju 61005, Republic of Korea
| |
Collapse
|
2
|
Zhang Z, Tian Y, Zhou T, Zhao Y, Zhang J, Li J. Towards an Environmentally Robust Speech Assistant System for Emergency Medical Services. Stud Health Technol Inform 2024; 310:1071-1075. [PMID: 38269979 DOI: 10.3233/shti231129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Automated speech recognition technology with robust performance in various environments is highly needed by emergency clinicians, but there are few successful cases. One main challenge is the wide variety of environmental interference involved during a typical prehospital care emergency service such as background noises and overlapping speech. To solve this problem, we try to establish an environmentally robust speech assistant system with the help of the proposed personalized speech enhancement (PSE) method, which utilizes the target physician's voiceprint feature to suppress non-target signal components. We demonstrate its potential value using both general public test set and our real EMS test set by evaluating the objective speech quality metrics, DNSMOS, and the recognition accuracy. Hopefully, the proposed method will raise EMS efficiency and security against non-target speech.
Collapse
Affiliation(s)
- Zhenchuan Zhang
- Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China
| | - Yu Tian
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Tianshu Zhou
- Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China
| | - Yinghao Zhao
- Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China
| | - Jungen Zhang
- Hangzhou Emergency Medical Center of Zhejiang Province, China
| | - Jingsong Li
- Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou, China
- Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| |
Collapse
|
3
|
Koh HI, Na S, Kim MN. Speech Perception Improvement Algorithm Based on a Dual-Path Long Short-Term Memory Network. Bioengineering (Basel) 2023; 10:1325. [PMID: 38002449 PMCID: PMC10669314 DOI: 10.3390/bioengineering10111325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 11/08/2023] [Accepted: 11/13/2023] [Indexed: 11/26/2023] Open
Abstract
Current deep learning-based speech enhancement methods focus on enhancing the time-frequency representation of the signal. However, conventional methods can lead to speech damage due to resolution mismatch problems that emphasize only specific information in the time or frequency domain. To address these challenges, this paper introduces a speech enhancement model designed with a dual-path structure that identifies key speech characteristics in both the time and time-frequency domains. Specifically, the time path aims to model semantic features hidden in the waveform, while the time-frequency path attempts to compensate for the spectral details via a spectral extension block. These two paths enhance temporal and spectral features via mask functions modeled as LSTM, respectively, offering a comprehensive approach to speech enhancement. Experimental results show that the proposed dual-path LSTM network consistently outperforms conventional single-domain speech enhancement methods in terms of speech quality and intelligibility.
Collapse
Affiliation(s)
- Hyeong Il Koh
- Department of Medical & Biological Engineering, Graduate School, Kyungpook National University, Daegu 41944, Republic of Korea;
| | - Sungdae Na
- Department of Biomedical Engineering, Kyungpook National University Hospital, Daegu 41944, Republic of Korea
| | - Myoung Nam Kim
- Department of Biomedical Engineering, School of Medicine, Kyungpook National University, Daegu 41944, Republic of Korea
| |
Collapse
|
4
|
Song Y, Madhu N. Investigations on the Optimal Estimation of Speech Envelopes for the Two-Stage Speech Enhancement. Sensors (Basel) 2023; 23:6438. [PMID: 37514732 PMCID: PMC10384514 DOI: 10.3390/s23146438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 07/10/2023] [Accepted: 07/13/2023] [Indexed: 07/30/2023]
Abstract
Using the source-filter model of speech production, clean speech signals can be decomposed into an excitation component and an envelope component that is related to the phoneme being uttered. Therefore, restoring the envelope of degraded speech during speech enhancement can improve the intelligibility and quality of output. As the number of phonemes in spoken speech is limited, they can be adequately represented by a correspondingly limited number of envelopes. This can be exploited to improve the estimation of speech envelopes from a degraded signal in a data-driven manner. The improved envelopes are then used in a second stage to refine the final speech estimate. Envelopes are typically derived from the linear prediction coefficients (LPCs) or from the cepstral coefficients (CCs). The improved envelope is obtained either by mapping the degraded envelope onto pre-trained codebooks (classification approach) or by directly estimating it from the degraded envelope (regression approach). In this work, we first investigate the optimal features for envelope representation and codebook generation by a series of oracle tests. We demonstrate that CCs provide better envelope representation compared to using the LPCs. Further, we demonstrate that a unified speech codebook is advantageous compared to the typical codebook that manually splits speech and silence as separate entries. Next, we investigate low-complexity neural network architectures to map degraded envelopes to the optimal codebook entry in practical systems. We confirm that simple recurrent neural networks yield good performance with a low complexity and number of parameters. We also demonstrate that with a careful choice of the feature and architecture, a regression approach can further improve the performance at a lower computational cost. However, as also seen from the oracle tests, the benefit of the two-stage framework is now chiefly limited by the statistical noise floor estimate, leading to only a limited improvement in extremely adverse conditions. This highlights the need for further research on joint estimation of speech and noise for optimum enhancement.
Collapse
Affiliation(s)
- Yanjue Song
- IDLab, Ghent University-imec, 9000 Gent, Belgium
| | - Nilesh Madhu
- IDLab, Ghent University-imec, 9000 Gent, Belgium
| |
Collapse
|
5
|
Rascon C. Characterization of Deep Learning-Based Speech-Enhancement Techniques in Online Audio Processing Applications. Sensors (Basel) 2023; 23:s23094394. [PMID: 37177598 PMCID: PMC10181690 DOI: 10.3390/s23094394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 04/24/2023] [Accepted: 04/28/2023] [Indexed: 05/15/2023]
Abstract
Deep learning-based speech-enhancement techniques have recently been an area of growing interest, since their impressive performance can potentially benefit a wide variety of digital voice communication systems. However, such performance has been evaluated mostly in offline audio-processing scenarios (i.e., feeding the model, in one go, a complete audio recording, which may extend several seconds). It is of significant interest to evaluate and characterize the current state-of-the-art in applications that process audio online (i.e., feeding the model a sequence of segments of audio data, concatenating the results at the output end). Although evaluations and comparisons between speech-enhancement techniques have been carried out before, as far as the author knows, the work presented here is the first that evaluates the performance of such techniques in relation to their online applicability. This means that this work measures how the output signal-to-interference ratio (as a separation metric), the response time, and memory usage (as online metrics) are impacted by the input length (the size of audio segments), in addition to the amount of noise, amount and number of interferences, and amount of reverberation. Three popular models were evaluated, given their availability on public repositories and online viability, MetricGAN+, Spectral Feature Mapping with Mimic Loss, and Demucs-Denoiser. The characterization was carried out using a systematic evaluation protocol based on the Speechbrain framework. Several intuitions are presented and discussed, and some recommendations for future work are proposed.
Collapse
Affiliation(s)
- Caleb Rascon
- Computer Science Department, Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas, Universidad Nacional Autonoma de Mexico, Mexico City 3000, Mexico
| |
Collapse
|
6
|
Chen H, Zhang X. CGA-MGAN: Metric GAN Based on Convolution-Augmented Gated Attention for Speech Enhancement. Entropy (Basel) 2023; 25:e25040628. [PMID: 37190416 PMCID: PMC10137386 DOI: 10.3390/e25040628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 03/15/2023] [Accepted: 04/04/2023] [Indexed: 05/17/2023]
Abstract
In recent years, neural networks based on attention mechanisms have seen increasingly use in speech recognition, separation, and enhancement, as well as other fields. In particular, the convolution-augmented transformer has performed well, as it can combine the advantages of convolution and self-attention. Recently, the gated attention unit (GAU) was proposed. Compared with traditional multi-head self-attention, approaches with GAU are effective and computationally efficient. In this CGA-MGAN: MetricGAN based on Convolution-augmented Gated Attention for Speech Enhancement, we propose a network for speech enhancement called CGA-MGAN, a kind of MetricGAN based on convolution-augmented gated attention. CGA-MGAN captures local and global correlations in speech signals at the same time by fusing convolution and gated attention units. Experiments on Voice Bank + DEMAND show that our proposed CGA-MGAN model achieves excellent performance (3.47 PESQ, 0.96 STOI, and 11.09 dB SSNR) with a relatively small model size (1.14 M).
Collapse
Affiliation(s)
- Haozhe Chen
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
- Key Laboratory of Electromagnetic Radiation and Sensing Technology, Chinese Academy of Sciences, Beijing 100190, China
- School of Electronic, Electrical, and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Xiaojuan Zhang
- Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
- Key Laboratory of Electromagnetic Radiation and Sensing Technology, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
7
|
Chai S, Guo C, Guan C, Fang L. Deep Learning-Based Speech Enhancement of an Extrinsic Fabry-Perot Interferometric Fiber Acoustic Sensor System. Sensors (Basel) 2023; 23:3574. [PMID: 37050634 PMCID: PMC10098526 DOI: 10.3390/s23073574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Revised: 03/17/2023] [Accepted: 03/23/2023] [Indexed: 06/19/2023]
Abstract
To achieve high-quality voice communication technology without noise interference in flammable, explosive and strong electromagnetic environments, the speech enhancement technology of a fiber-optic external Fabry-Perot interferometric (EFPI) acoustic sensor based on deep learning is studied in this paper. The combination of a complex-valued convolutional neural network and a long short-term memory (CV-CNN-LSTM) model is proposed for speech enhancement in the EFPI acoustic sensing system. Moreover, the 3 × 3 coupler algorithm is used to demodulate voice signals. Then, the short-time Fourier transform (STFT) spectrogram features of voice signals are divided into a training set and a test set. The training set is input into the established CV-CNN-LSTM model for model training, and the test set is input into the trained model for testing. The experimental findings reveal that the proposed CV-CNN-LSTM model demonstrates exceptional speech enhancement performance, boasting an average Perceptual Evaluation of Speech Quality (PESQ) score of 3.148. In comparison to the CV-CNN and CV-LSTM models, this innovative model achieves a remarkable PESQ score improvement of 9.7% and 11.4%, respectively. Furthermore, the average Short-Time Objective Intelligibility (STOI) score witnesses significant enhancements of 4.04 and 2.83 when contrasted with the CV-CNN and CV-LSTM models, respectively.
Collapse
Affiliation(s)
- Shiyi Chai
- School of Science, Hubei University of Technology, Wuhan 430068, China
- Hubei Engineering Technology Research Center of Energy Photoelectric Device and System, Hubei University of Technology, Wuhan 430068, China
| | - Can Guo
- School of Science, Hubei University of Technology, Wuhan 430068, China
- Hubei Engineering Technology Research Center of Energy Photoelectric Device and System, Hubei University of Technology, Wuhan 430068, China
| | - Chenggang Guan
- Hubei Engineering Technology Research Center of Energy Photoelectric Device and System, Hubei University of Technology, Wuhan 430068, China
| | - Li Fang
- School of Science, Hubei University of Technology, Wuhan 430068, China
| |
Collapse
|
8
|
Pandey A, Wang D. Attentive Training: A New Training Framework for Speech Enhancement. IEEE/ACM Trans Audio Speech Lang Process 2023; 31:1360-1370. [PMID: 37899765 PMCID: PMC10602021 DOI: 10.1109/taslp.2023.3260711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/31/2023]
Abstract
Dealing with speech interference in a speech enhancement system requires either speaker separation or target speaker extraction. Speaker separation has multiple output streams with arbitrary assignments while target speaker extraction requires additional cueing for speaker selection. Both of these are not suitable for a standalone speech enhancement system with one output stream. In this study, we propose a novel training framework, called Attentive Training, to extend speech enhancement to deal with speech interruptions. Attentive training is based on the observation that, in the real world, multiple talkers very unlikely start speaking at the same time, and therefore, a deep neural network can be trained to create a representation of the first speaker and utilize it to attend to or track that speaker in a multitalker noisy mixture. We present experimental results and comparisons to demonstrate the effectiveness of attentive training for speech enhancement.
Collapse
Affiliation(s)
- Ashutosh Pandey
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
9
|
Tan K, Mao W, Guo X, Lu H, Zhang C, Cao Z, Wang X. CST: Complex Sparse Transformer for Low-SNR Speech Enhancement. Sensors (Basel) 2023; 23:2376. [PMID: 36904579 PMCID: PMC10007472 DOI: 10.3390/s23052376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 02/05/2023] [Accepted: 02/16/2023] [Indexed: 06/18/2023]
Abstract
Speech enhancement tasks for audio with a low SNR are challenging. Existing speech enhancement methods are mainly designed for high SNR audio, and they usually use RNNs to model audio sequence features, which causes the model to be unable to learn long-distance dependencies, thus limiting its performance in low-SNR speech enhancement tasks. We design a complex transformer module with sparse attention to overcome this problem. Different from the traditional transformer model, this model is extended to effectively model complex domain sequences, using the sparse attention mask balance model's attention to long-distance and nearby relations, introducing the pre-layer positional embedding module to enhance the model's perception of position information, adding the channel attention module to enable the model to dynamically adjust the weight distribution between channels according to the input audio. The experimental results show that, in the low-SNR speech enhancement tests, our models have noticeable performance improvements in speech quality and intelligibility, respectively.
Collapse
Affiliation(s)
- Kaijun Tan
- Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
- University of Chinese Academy of Sciences, Beijing 100089, China
| | - Wenyu Mao
- Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
- Chinese Association of Artificial Intelligence, Beijing 100876, China
| | - Xiaozhou Guo
- Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
- University of Chinese Academy of Sciences, Beijing 100089, China
| | - Huaxiang Lu
- Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China
- University of Chinese Academy of Sciences, Beijing 100089, China
- Materials and Optoelectronics Research Center, University of Chinese Academy of Sciences, Beijing 100083, China
- College of Microelectronics, University of Chinese Academy of Sciences, Beijing 100083, China
- Semiconductor Neural Network Intelligent Perception and Computing Technology Beijing Key Laboratory, Beijing 100083, China
| | - Chi Zhang
- Nanjing Research Institute of Information Technology, Nanjing 210009, China
| | - Zhanzhong Cao
- Nanjing Research Institute of Information Technology, Nanjing 210009, China
| | - Xingang Wang
- Nanjing Research Institute of Information Technology, Nanjing 210009, China
| |
Collapse
|
10
|
Ye M, Wan H. Improved Transformer-Based Dual-Path Network with Amplitude and Complex Domain Feature Fusion for Speech Enhancement. Entropy (Basel) 2023; 25:228. [PMID: 36832595 PMCID: PMC9955017 DOI: 10.3390/e25020228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 01/23/2023] [Accepted: 01/24/2023] [Indexed: 06/18/2023]
Abstract
Most previous speech enhancement methods only predict amplitude features, but more and more studies have proved that phase information is crucial for speech quality. Recently, there have also been some methods to choose complex features, but complex masks are difficult to estimate. Removing noise while maintaining good speech quality at low signal-to-noise ratios is still a problem. This study proposes a dual-path network structure for speech enhancement that can model complex spectra and amplitudes simultaneously, and introduces an attention-aware feature fusion module to fuse the two features to facilitate overall spectrum recovery. In addition, we improve a transformer-based feature extraction module that can efficiently extract local and global features. The proposed network achieves better performance than the baseline models in experiments on the Voice Bank + DEMAND dataset. We also conducted ablation experiments to verify the effectiveness of the dual-path structure, the improved transformer, and the fusion module, and investigated the effect of the input-mask multiplication strategy on the results.
Collapse
|
11
|
Zheng C, Zhang H, Liu W, Luo X, Li A, Li X, Moore BCJ. Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods. Trends Hear 2023; 27:23312165231209913. [PMID: 37956661 PMCID: PMC10658184 DOI: 10.1177/23312165231209913] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 10/09/2023] [Indexed: 11/15/2023] Open
Abstract
Frequency-domain monaural speech enhancement has been extensively studied for over 60 years, and a great number of methods have been proposed and applied to many devices. In the last decade, monaural speech enhancement has made tremendous progress with the advent and development of deep learning, and performance using such methods has been greatly improved relative to traditional methods. This survey paper first provides a comprehensive overview of traditional and deep-learning methods for monaural speech enhancement in the frequency domain. The fundamental assumptions of each approach are then summarized and analyzed to clarify their limitations and advantages. A comprehensive evaluation of some typical methods was conducted using the WSJ + Deep Noise Suppression (DNS) challenge and Voice Bank + DEMAND datasets to give an intuitive and unified comparison. The benefits of monaural speech enhancement methods using objective metrics relevant for normal-hearing and hearing-impaired listeners were evaluated. The objective test results showed that compression of the input features was important for simulated normal-hearing listeners but not for simulated hearing-impaired listeners. Potential future research and development topics in monaural speech enhancement are suggested.
Collapse
Affiliation(s)
- Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Huiyong Zhang
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wenzhe Liu
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoxue Luo
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Andong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiaodong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Brian C. J. Moore
- Cambridge Hearing Group, Department of Psychology, University of Cambridge, Cambridge, UK
| |
Collapse
|
12
|
Hao X, Zhu D, Wang X, Yang L, Zeng H. A Speech Enhancement Algorithm for Speech Reconstruction Based on Laser Speckle Images. Sensors (Basel) 2022; 23:330. [PMID: 36616925 PMCID: PMC9823416 DOI: 10.3390/s23010330] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Revised: 09/02/2022] [Accepted: 09/03/2022] [Indexed: 06/17/2023]
Abstract
In the optical system for reconstructing speech signals based on laser speckle images, the resonance between the sound source and nearby objects leads to frequency response problem, which seriously affects the accuracy of reconstructed speech. In this paper, we propose a speech enhancement algorithm to reduce the frequency response. The results show that after using the speech enhancement algorithm, the frequency spectrum correlation coefficient between the reconstructed sinusoidal signal and the original sinusoidal signal is improved by up to 82.45%, and the real speech signal is improved by up to 56.40%. This proves that the speech enhancement algorithm is a valuable tool for solving the frequency response problem and improving the accuracy of reconstructed speech.
Collapse
Affiliation(s)
- Xueying Hao
- Graduate Department, Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China
| | - Dali Zhu
- Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
- School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Xianlan Wang
- Graduate Department, Wuhan Research Institute of Posts and Telecommunications, Wuhan 430074, China
| | - Long Yang
- Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
- School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hualin Zeng
- Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
- School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
13
|
Liu TH, Chi JZ, Wu BL, Chen YS, Huang CH, Chu YS. Design and Implementation of Machine Tool Life Inspection System Based on Sound Sensing. Sensors (Basel) 2022; 23:284. [PMID: 36616882 PMCID: PMC9823646 DOI: 10.3390/s23010284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 12/15/2022] [Accepted: 12/23/2022] [Indexed: 06/17/2023]
Abstract
The main causes of damage to industrial machinery are aging, corrosion, and the wear of parts, which affect the accuracy of machinery and product precision. Identifying problems early and predicting the life cycle of a machine for early maintenance can avoid costly plant failures. Compared with other sensing and monitoring instruments, sound sensors are inexpensive, portable, and have less computational data. This paper proposed a machine tool life cycle model with noise reduction. The life cycle model uses Mel-Frequency Cepstral Coefficients (MFCC) to extract audio features. A Deep Neural Network (DNN) is used to understand the relationship between audio features and life cycle, and then determine the audio signal corresponding to the aging degree. The noise reduction model simulates the actual environment by adding noise and extracts features by Power Normalized Cepstral Coefficients (PNCC), and designs Mask as the DNN's learning target to eliminate the effect of noise. The effect of the denoising model is improved by 6.8% under Short-Time Objective Intelligibility (STOI). There is a 3.9% improvement under Perceptual Evaluation of Speech Quality (PESQ). The life cycle model accuracy before denoising is 76%. After adding the noise reduction system, the accuracy of the life cycle model is increased to 80%.
Collapse
Affiliation(s)
- Tsung-Hsien Liu
- Communications Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Jun-Zhe Chi
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Bo-Lin Wu
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Yee-Shao Chen
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Chung-Hsun Huang
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| | - Yuan-Sun Chu
- Electrical Engineering Department, National Chung Cheng University, Chiayi 62102, Taiwan
| |
Collapse
|
14
|
Wang H, Zhang X, Wang D. Fusing Bone-conduction and Air-conduction Sensors for Complex-Domain Speech Enhancement. IEEE/ACM Trans Audio Speech Lang Process 2022; 30:3134-3143. [PMID: 37124143 PMCID: PMC10147322 DOI: 10.1109/taslp.2022.3209943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Speech enhancement aims to improve the listening quality and intelligibility of noisy speech in adverse environments. It proves to be challenging to perform speech enhancement in very low signal-to-noise ratio (SNR) conditions. Conventional speech enhancement utilizes air-conduction (AC) microphones, which are sensitive to background noise but capable of capturing full-band signals. On the other hand, bone-conduction (BC) sensors are unaffected by acoustic noise, but recorded speech has limited bandwidth. This study proposes an attention-based fusion method to combine the strengths of AC and BC signals and perform complex spectral mapping for speech enhancement. Experiments on the EMSB dataset demonstrate that the proposed approach effectively leverages the advantages of AC and BC sensors, and outperforms a recent time-domain baseline in all conditions. We also show that the sensor fusion method is superior to single-sensor counterparts, especially in low SNR conditions. As the amount of BC data is very limited, we additionally propose a semi-supervised technique to utilize both parallelly and unparallely recorded AC and BC speech signals. With additional AC speech from the AISHELL-1 dataset, we achieve similar performance to supervised learning with only 50% parallel data.
Collapse
Affiliation(s)
- Heming Wang
- Department of Computer Science and Engineering, The Ohio State University, OH 43210 USA
| | - Xueliang Zhang
- Department of Computer Science, Inner Mongolia University, Hohhot 010021, China
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA
| |
Collapse
|
15
|
Lee GW, Kim HK. Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition. Sensors (Basel) 2022; 22:5381. [PMID: 35891070 PMCID: PMC9324918 DOI: 10.3390/s22145381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 07/16/2022] [Accepted: 07/17/2022] [Indexed: 06/15/2023]
Abstract
In this paper, a new two-step joint optimization approach based on the asynchronous subregion optimization method is proposed for training a pipeline model composed of two different models. The first-step processing of the proposed joint optimization approach trains the front-end model only, and the second-step processing trains all the parameters of the combined model together. In the asynchronous subregion optimization method, the first-step processing only supports the goal of the front-end model. However, the first-step processing of the proposed approach works with a new loss function to make the front-end model support the goal of the back-end model. The proposed optimization approach was applied, here, to a pipeline composed of a deep complex convolutional recurrent network (DCCRN)-based speech enhancement model and a conformer-transducer-based ASR model as a front-end and a back-end, respectively. Then, the performance of the proposed two-step joint optimization approach was evaluated on the LibriSpeech automatic speech recognition (ASR) corpus in noisy environments by measuring the character error rate (CER) and word error rate (WER). In addition, an ablation study was carried out to examine the effectiveness of the proposed optimization approach on each of the processing blocks in the conformer-transducer ASR model. Consequently, it was shown from the ablation study that the conformer-transducer-based ASR model with the joint network trained only by the proposed optimization approach achieved the lowest average CER and WER. Moreover, the proposed optimization approach reduced the average CER and WER on the Test-Noisy dataset under matched noise conditions by 0.30% and 0.48%, respectively, compared to the approach of separate optimization of speech enhancement and ASR. Compared to the conventional two-step joint optimization approach, the proposed optimization approach provided average CER and WER reductions of 0.22% and 0.31%, respectively. Moreover, it was revealed that the proposed optimization approach achieved a lower average CER and WER, by 0.32% and 0.43%, respectively, than the conventional optimization approach under mismatched noise conditions.
Collapse
Affiliation(s)
- Geon Woo Lee
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Korea;
| | - Hong Kook Kim
- AI Graduate School, Gwangju Institute of Science and Technology, Gwangju 61005, Korea;
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, Korea
| |
Collapse
|
16
|
Tao T, Zheng H, Yang J, Guo Z, Zhang Y, Ao J, Chen Y, Lin W, Tan X. Sound Localization and Speech Enhancement Algorithm Based on Dual-Microphone. Sensors (Basel) 2022; 22:s22030715. [PMID: 35161469 PMCID: PMC8840739 DOI: 10.3390/s22030715] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 01/11/2022] [Accepted: 01/12/2022] [Indexed: 11/16/2022]
Abstract
In order to simplify the complexity and reduce the cost of the microphone array, this paper proposes a dual-microphone based sound localization and speech enhancement algorithm. Based on the time delay estimation of the signal received by the dual microphones, this paper combines energy difference estimation and controllable beam response power to realize the 3D coordinate calculation of the acoustic source and dual-microphone sound localization. Based on the azimuth angle of the acoustic source and the analysis of the independent quantity of the speech signal, the separation of the speaker signal of the acoustic source is realized. On this basis, post-wiener filtering is used to amplify and suppress the voice signal of the speaker, which can help to achieve speech enhancement. Experimental results show that the dual-microphone sound localization algorithm proposed in this paper can accurately identify the sound location, and the speech enhancement algorithm is more robust and adaptable than the original algorithm.
Collapse
|
17
|
Ali MN, Falavigna D, Brutti A. Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models. Sensors (Basel) 2022; 22:374. [PMID: 35009917 DOI: 10.3390/s22010374] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 12/29/2021] [Accepted: 12/30/2021] [Indexed: 12/10/2022]
Abstract
Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches.
Collapse
|
18
|
Kang Y, Zheng N, Meng Q. Deep Learning-Based Speech Enhancement With a Loss Trading Off the Speech Distortion and the Noise Residue for Cochlear Implants. Front Med (Lausanne) 2021; 8:740123. [PMID: 34820392 PMCID: PMC8606413 DOI: 10.3389/fmed.2021.740123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 10/04/2021] [Indexed: 11/18/2022] Open
Abstract
The cochlea plays a key role in the transmission from acoustic vibration to neural stimulation upon which the brain perceives the sound. A cochlear implant (CI) is an auditory prosthesis to replace the damaged cochlear hair cells to achieve acoustic-to-neural conversion. However, the CI is a very coarse bionic imitation of the normal cochlea. The highly resolved time-frequency-intensity information transmitted by the normal cochlea, which is vital to high-quality auditory perception such as speech perception in challenging environments, cannot be guaranteed by CIs. Although CI recipients with state-of-the-art commercial CI devices achieve good speech perception in quiet backgrounds, they usually suffer from poor speech perception in noisy environments. Therefore, noise suppression or speech enhancement (SE) is one of the most important technologies for CI. In this study, we introduce recent progress in deep learning (DL), mostly neural networks (NN)-based SE front ends to CI, and discuss how the hearing properties of the CI recipients could be utilized to optimize the DL-based SE. In particular, different loss functions are introduced to supervise the NN training, and a set of objective and subjective experiments is presented. Results verify that the CI recipients are more sensitive to the residual noise than the SE-induced speech distortion, which has been common knowledge in CI research. Furthermore, speech reception threshold (SRT) in noise tests demonstrates that the intelligibility of the denoised speech can be significantly improved when the NN is trained with a loss function bias to more noise suppression than that with equal attention on noise residue and speech distortion.
Collapse
Affiliation(s)
- Yuyong Kang
- Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Nengheng Zheng
- Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China.,Pengcheng Laboratory, Shenzhen, China
| | - Qinglin Meng
- Acoustics Laboratory, School of Physics and Optoelectronics, South China University of Technology, Guangzhou, China
| |
Collapse
|
19
|
Gnanamanickam J, Natarajan Y, K R SP. A Hybrid Speech Enhancement Algorithm for Voice Assistance Application. Sensors (Basel) 2021; 21:7025. [PMID: 34770332 DOI: 10.3390/s21217025] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/17/2021] [Accepted: 10/18/2021] [Indexed: 11/17/2022]
Abstract
In recent years, speech recognition technology has become a more common notion. Speech quality and intelligibility are critical for the convenience and accuracy of information transmission in speech recognition. The speech processing systems used to converse or store speech are usually designed for an environment without any background noise. However, in a real-world atmosphere, background intervention in the form of background noise and channel noise drastically reduces the performance of speech recognition systems, resulting in imprecise information transfer and exhausting the listener. When communication systems' input or output signals are affected by noise, speech enhancement techniques try to improve their performance. To ensure the correctness of the text produced from speech, it is necessary to reduce the external noises involved in the speech audio. Reducing the external noise in audio is difficult as the speech can be of single, continuous or spontaneous words. In automatic speech recognition, there are various typical speech enhancement algorithms available that have gained considerable attention. However, these enhancement algorithms work well in simple and continuous audio signals only. Thus, in this study, a hybridized speech recognition algorithm to enhance the speech recognition accuracy is proposed. Non-linear spectral subtraction, a well-known speech enhancement algorithm, is optimized with the Hidden Markov Model and tested with 6660 medical speech transcription audio files and 1440 Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio files. The performance of the proposed model is compared with those of various typical speech enhancement algorithms, such as iterative signal enhancement algorithm, subspace-based speech enhancement, and non-linear spectral subtraction. The proposed cascaded hybrid algorithm was found to achieve a minimum word error rate of 9.5% and 7.6% for medical speech and RAVDESS speech, respectively. The cascading of the speech enhancement and speech-to-text conversion architectures results in higher accuracy for enhanced speech recognition. The evaluation results confirm the incorporation of the proposed method with real-time automatic speech recognition medical applications where the complexity of terms involved is high.
Collapse
|
20
|
Kuruvila I, Muncke J, Fischer E, Hoppe U. Extracting the Auditory Attention in a Dual-Speaker Scenario From EEG Using a Joint CNN-LSTM Model. Front Physiol 2021; 12:700655. [PMID: 34408661 PMCID: PMC8365753 DOI: 10.3389/fphys.2021.700655] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Accepted: 07/05/2021] [Indexed: 11/25/2022] Open
Abstract
Human brain performs remarkably well in segregating a particular speaker from interfering ones in a multispeaker scenario. We can quantitatively evaluate the segregation capability by modeling a relationship between the speech signals present in an auditory scene, and the listener's cortical signals measured using electroencephalography (EEG). This has opened up avenues to integrate neuro-feedback into hearing aids where the device can infer user's attention and enhance the attended speaker. Commonly used algorithms to infer the auditory attention are based on linear systems theory where cues such as speech envelopes are mapped on to the EEG signals. Here, we present a joint convolutional neural network (CNN)—long short-term memory (LSTM) model to infer the auditory attention. Our joint CNN-LSTM model takes the EEG signals and the spectrogram of the multiple speakers as inputs and classifies the attention to one of the speakers. We evaluated the reliability of our network using three different datasets comprising of 61 subjects, where each subject undertook a dual-speaker experiment. The three datasets analyzed corresponded to speech stimuli presented in three different languages namely German, Danish, and Dutch. Using the proposed joint CNN-LSTM model, we obtained a median decoding accuracy of 77.2% at a trial duration of 3 s. Furthermore, we evaluated the amount of sparsity that the model can tolerate by means of magnitude pruning and found a tolerance of up to 50% sparsity without substantial loss of decoding accuracy.
Collapse
Affiliation(s)
- Ivine Kuruvila
- Department of Audiology, ENT-Clinic, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | - Jan Muncke
- Department of Audiology, ENT-Clinic, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| | | | - Ulrich Hoppe
- Department of Audiology, ENT-Clinic, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
| |
Collapse
|
21
|
Chu K, Collins L, Mainsah B. A CAUSAL DEEP LEARNING FRAMEWORK FOR CLASSIFYING PHONEMES IN COCHLEAR IMPLANTS. Proc IEEE Int Conf Acoust Speech Signal Process 2021; 2021:6498-6502. [PMID: 34512195 PMCID: PMC8425961 DOI: 10.1109/icassp39728.2021.9413986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Speech intelligibility in cochlear implant (CI) users degrades considerably in listening environments with reverberation and noise. Previous research in automatic speech recognition (ASR) has shown that phoneme-based speech enhancement algorithms improve ASR system performance in reverberant environments as compared to a global model. However, phoneme-specific speech processing has not yet been implemented in CIs. In this paper, we propose a causal deep learning framework for classifying phonemes using features extracted at the time-frequency resolution of a CI processor. We trained and tested long short-term memory networks to classify phonemes and manner of articulation in anechoic and reverberant conditions. The results showed that CI-inspired features provide slightly higher levels of performance than traditional ASR features. To the best of our knowledge, this study is the first to provide a classification framework with the potential to categorize phonetic units in real-time in a CI.
Collapse
Affiliation(s)
- Kevin Chu
- Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA
| | - Leslie Collins
- Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA
| | - Boyla Mainsah
- Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA
| |
Collapse
|
22
|
Abstract
The use of deep neural networks (DNNs) has dramatically elevated the performance of speech enhancement over the last decade. However, to achieve strong enhancement performance typically requires a large DNN, which is both memory and computation consuming, making it difficult to deploy such speech enhancement systems on devices with limited hardware resources or in applications with strict latency requirements. In this study, we propose two compression pipelines to reduce the model size for DNN-based speech enhancement, which incorporates three different techniques: sparse regularization, iterative pruning and clustering-based quantization. We systematically investigate these techniques and evaluate the proposed compression pipelines. Experimental results demonstrate that our approach reduces the sizes of four different models by large margins without significantly sacrificing their enhancement performance. In addition, we find that the proposed approach performs well on speaker separation, which further demonstrates the effectiveness of the approach for compressing speech separation models.
Collapse
Affiliation(s)
- Ke Tan
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210-1277 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277, USA
| |
Collapse
|
23
|
Zhou Y, Wang H, Chu Y, Liu H. A Robust Dual-Microphone Generalized Sidelobe Canceller Using a Bone-Conduction Sensor for Speech Enhancement. Sensors (Basel) 2021; 21:1878. [PMID: 33800201 PMCID: PMC7962448 DOI: 10.3390/s21051878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 02/23/2021] [Accepted: 03/05/2021] [Indexed: 11/21/2022]
Abstract
The use of multiple spatially distributed microphones allows performing spatial filtering along with conventional temporal filtering, which can better reject the interference signals, leading to an overall improvement of the speech quality. In this paper, we propose a novel dual-microphone generalized sidelobe canceller (GSC) algorithm assisted by a bone-conduction (BC) sensor for speech enhancement, which is named BC-assisted GSC (BCA-GSC) algorithm. The BC sensor is relatively insensitive to the ambient noise compared to the conventional air-conduction (AC) microphone. Hence, BC speech can be analyzed to generate very accurate voice activity detection (VAD), even in a high noise environment. The proposed algorithm incorporates the VAD information obtained by the BC speech into the adaptive blocking matrix (ABM) and adaptive noise canceller (ANC) in GSC. By using VAD to control ABM and combining VAD with signal-to-interference ratio (SIR) to control ANC, the proposed method could suppress interferences and improve the overall performance of GSC significantly. It is verified by experiments that the proposed GSC system not only improves speech quality remarkably but also boosts speech intelligibility.
Collapse
Affiliation(s)
- Yi Zhou
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (H.W.); (H.L.)
| | - Haiping Wang
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (H.W.); (H.L.)
| | - Yijing Chu
- State Key Laboratory of Subtropical Building Science, South China University of Technology, Guangzhou 510641, China
| | - Hongqing Liu
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (H.W.); (H.L.)
| |
Collapse
|
24
|
Li L, Rehr R, Bruns P, Gerkmann T, Röder B. A Survey on Probabilistic Models in Human Perception and Machines. Front Robot AI 2021; 7:85. [PMID: 33501252 PMCID: PMC7805657 DOI: 10.3389/frobt.2020.00085] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Accepted: 05/29/2020] [Indexed: 11/29/2022] Open
Abstract
Extracting information from noisy signals is of fundamental importance for both biological and artificial perceptual systems. To provide tractable solutions to this challenge, the fields of human perception and machine signal processing (SP) have developed powerful computational models, including Bayesian probabilistic models. However, little true integration between these fields exists in their applications of the probabilistic models for solving analogous problems, such as noise reduction, signal enhancement, and source separation. In this mini review, we briefly introduce and compare selective applications of probabilistic models in machine SP and human psychophysics. We focus on audio and audio-visual processing, using examples of speech enhancement, automatic speech recognition, audio-visual cue integration, source separation, and causal inference to illustrate the basic principles of the probabilistic approach. Our goal is to identify commonalities between probabilistic models addressing brain processes and those aiming at building intelligent machines. These commonalities could constitute the closest points for interdisciplinary convergence.
Collapse
Affiliation(s)
- Lux Li
- Biological Psychology and Neuropsychology, University of Hamburg, Hamburg, Germany
| | - Robert Rehr
- Signal Processing (SP), Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Patrick Bruns
- Biological Psychology and Neuropsychology, University of Hamburg, Hamburg, Germany
| | - Timo Gerkmann
- Signal Processing (SP), Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Brigitte Röder
- Biological Psychology and Neuropsychology, University of Hamburg, Hamburg, Germany
| |
Collapse
|
25
|
Gößling N, Marquardt D, Doclo S. Perceptual Evaluation of Binaural MVDR-Based Algorithms to Preserve the Interaural Coherence of Diffuse Noise Fields. Trends Hear 2020; 24:2331216520919573. [PMID: 32339061 PMCID: PMC7225838 DOI: 10.1177/2331216520919573] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Besides improving speech intelligibility in background noise, another important objective of noise reduction algorithms for binaural hearing devices is preserving the spatial impression for the listener. In this study, we evaluate the performance of several recently proposed noise reduction algorithms based on the binaural minimum-variance-distortionless-response (MVDR) beamformer, which trade-off between noise reduction performance and preservation of the interaural coherence (IC) for diffuse noise fields. Aiming at a perceptually optimized result, this trade-off is determined based on the IC discrimination ability of the human auditory system. The algorithms are evaluated with normal-hearing participants for an anechoic scenario and a reverberant cafeteria scenario, in terms of both speech intelligibility using a matrix sentence test and spatial quality using a MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA). The results show that all the binaural noise reduction algorithms are able to improve speech intelligibility compared with the unprocessed microphone signals, where partially preserving the IC of the diffuse noise field leads to a significant improvement in perceived spatial quality compared with the binaural MVDR beamformer while hardly affecting speech intelligibility.
Collapse
Affiliation(s)
- Nico Gößling
- Department of Medical Physics and Acoustics and Cluster of Excellence Hearing4all, University of Oldenburg
| | - Daniel Marquardt
- Starkey Hearing Technologies, Eden Prairie, Minnesota, United States
| | - Simon Doclo
- Department of Medical Physics and Acoustics and Cluster of Excellence Hearing4all, University of Oldenburg
| |
Collapse
|
26
|
Kim SM. Wearable Hearing Device Spectral Enhancement Driven by Non-Negative Sparse Coding-Based Residual Noise Reduction. Sensors (Basel) 2020; 20:E5751. [PMID: 33050447 DOI: 10.3390/s20205751] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/02/2020] [Accepted: 10/06/2020] [Indexed: 11/17/2022]
Abstract
This paper proposes a novel technique to improve a spectral statistical filter for speech enhancement, to be applied in wearable hearing devices such as hearing aids. The proposed method is implemented considering a 32-channel uniform polyphase discrete Fourier transform filter bank, for which the overall algorithm processing delay is 8 ms in accordance with the hearing device requirements. The proposed speech enhancement technique, which exploits the concepts of both non-negative sparse coding (NNSC) and spectral statistical filtering, provides an online unified framework to overcome the problem of residual noise in spectral statistical filters under noisy environments. First, the spectral gain attenuator of the statistical Wiener filter is obtained using the a priori signal-to-noise ratio (SNR) estimated through a decision-directed approach. Next, the spectrum estimated using the Wiener spectral gain attenuator is decomposed by applying the NNSC technique to the target speech and residual noise components. These components are used to develop an NNSC-based Wiener spectral gain attenuator to achieve enhanced speech. The performance of the proposed NNSC-Wiener filter was evaluated through a perceptual evaluation of the speech quality scores under various noise conditions with SNRs ranging from -5 to 20 dB. The results indicated that the proposed NNSC-Wiener filter can outperform the conventional Wiener filter and NNSC-based speech enhancement methods at all SNRs.
Collapse
|
27
|
Shankar N, Bhat GS, Panahi IMS. Real-time single-channel deep neural network-based speech enhancement on edge devices. Interspeech 2020; 2020:3281-3285. [PMID: 33898608 PMCID: PMC8064406 DOI: 10.21437/interspeech.2020-1901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this paper, we present a deep neural network architecture comprising of both convolutional neural network (CNN) and recurrent neural network (RNN) layers for real-time single-channel speech enhancement (SE). The proposed neural network model focuses on enhancing the noisy speech magnitude spectrum on a frame-by-frame process. The developed model is implemented on the smartphone (edge device), to demonstrate the real-time usability of the proposed method. Perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) test results are used to compare the proposed algorithm to previously published conventional and deep learning-based SE methods. Subjective ratings show the performance improvement of the proposed model over the other baseline SE methods.
Collapse
Affiliation(s)
- Nikhil Shankar
- Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX-75080, USA
| | - Gautam Shreedhar Bhat
- Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX-75080, USA
| | - Issa M S Panahi
- Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX-75080, USA
| |
Collapse
|
28
|
Zhou Y, Chen Y, Ma Y, Liu H. A Real-Time Dual-Microphone Speech Enhancement Algorithm Assisted by Bone Conduction Sensor. Sensors (Basel) 2020; 20:E5050. [PMID: 32899533 PMCID: PMC7571026 DOI: 10.3390/s20185050] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 09/02/2020] [Accepted: 09/03/2020] [Indexed: 11/16/2022]
Abstract
The quality and intelligibility of the speech are usually impaired by the interference of background noise when using internet voice calls. To solve this problem in the context of wearable smart devices, this paper introduces a dual-microphone, bone-conduction (BC) sensor assisted beamformer and a simple recurrent unit (SRU)-based neural network postfilter for real-time speech enhancement. Assisted by the BC sensor, which is insensitive to the environmental noise compared to the regular air-conduction (AC) microphone, the accurate voice activity detection (VAD) can be obtained from the BC signal and incorporated into the adaptive noise canceller (ANC) and adaptive block matrix (ABM). The SRU-based postfilter consists of a recurrent neural network with a small number of parameters, which improves the computational efficiency. The sub-band signal processing is designed to compress the input features of the neural network, and the scale-invariant signal-to-distortion ratio (SI-SDR) is developed as the loss function to minimize the distortion of the desired speech signal. Experimental results demonstrate that the proposed real-time speech enhancement system provides significant speech sound quality and intelligibility improvements for all noise types and levels when compared with the AC-only beamformer with a postfiltering algorithm.
Collapse
Affiliation(s)
- Yi Zhou
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (Y.C.)
| | - Yufan Chen
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (Y.C.)
| | - Yongbao Ma
- Suresense Technology, Chongqing 400065, China;
| | - Hongqing Liu
- School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China; (Y.Z.); (Y.C.)
| |
Collapse
|
29
|
Wang ZQ, Wang P, Wang D. Complex Spectral Mapping for Single- and Multi-Channel Speech Enhancement and Robust ASR. IEEE/ACM Trans Audio Speech Lang Process 2020; 28:1778-1787. [PMID: 33748326 PMCID: PMC7971156 DOI: 10.1109/taslp.2020.2998279] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
This study proposes a complex spectral mapping approach for single- and multi-channel speech enhancement, where deep neural networks (DNNs) are used to predict the real and imaginary (RI) components of the direct-path signal from noisy and reverberant ones. The proposed system contains two DNNs. The first one performs single-channel complex spectral mapping. The estimated complex spectra are used to compute a minimum variance distortion-less response (MVDR) beamformer. The RI components of beamforming results, which encode spatial information, are then combined with the RI components of the mixture to train the second DNN for multi-channel complex spectral mapping. With estimated complex spectra, we also propose a novel method of time-varying beamforming. State-of-the-art performance is obtained on the speech enhancement and recognition tasks of the CHiME-4 corpus. More specifically, our system obtains 6.82%, 3.19% and 2.00% word error rates (WER) respectively on the single-, two-, and six-microphone tasks of CHiME-4, significantly surpassing the current best results of 9.15%, 3.91% and 2.24% WER.
Collapse
Affiliation(s)
- Zhong-Qiu Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA
| | - Peidong Wang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277 USA
| | - DeLiang Wang
- Department of Computer Science and Engineering & the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277 USA
| |
Collapse
|
30
|
Abstract
Two signal-processing procedures for separating the continuously-voiced speech of competing talkers are described and evaluated. With competing sentences, each spoken on a monotone, the procedures improved the intelligibility of the target talker both for listeners with normal hearing and for listeners with moderate-to-severe hearing losses of cochlear origin. However, with intoned sentences, benefits were smaller for normal-hearing listeners and were inconsistent for impaired listeners. It is argued that smaller benefits arise with intoned sentences because harmonics of the two voices are blurred together during spectral analysis, limiting the extent to which spectral contrast can be recovered in the processed signal. This is particularly disadvantageous to impaired listeners who have reduced spectro-temporal resolution. This paper discusses other substantial problems to be overcome before the feasibility of the procedures as components of a speech-enhancement system for hearing-impaired listeners could be demonstrated.
Collapse
Affiliation(s)
- Quentin Summerfield
- MRC Institute of Hearing Research, University of Nottingham, Nottingham, England
| | - Richard J. Stubbs
- MRC Institute of Hearing Research, University of Nottingham, Nottingham, England
| |
Collapse
|
31
|
Chen Y, Chen W, Zhang P, Chen P. [Research progress of microphone array based front-end speech enhancement technology for cochlear implant]. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi 2019; 36:696-704. [PMID: 31441274 PMCID: PMC10319500 DOI: 10.7507/1001-5515.201805050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Indexed: 11/03/2022]
Abstract
Microphone array based methods are gradually applied in the front-end speech enhancement and speech recognition improvement for cochlear implant in recent years. By placing several microphones in different locations in space, this method can collect multi-channel signals containing a lot of spatial position and orientation information. Microphone array can also yield specific beamforming mode to enhance desired signal and suppress ambient noise, which is particularly suitable to be applied in face-to-face conversation for cochlear implant users. And its application value has attracted more and more attention from researchers. In this paper, we describe the principle of microphone array method, analyze the microphone array based speech enhancement technologies in present literature, and further present the technical difficulties and development trend.
Collapse
Affiliation(s)
- Yousheng Chen
- Shenzhen Institute of Information Technology, Shenzhen, Guangdong 518000,
| | - Weifang Chen
- Shenzhen Institute of Information Technology, Shenzhen, Guangdong 518000, P.R.China
| | - Pu Zhang
- Shenzhen Institute of Information Technology, Shenzhen, Guangdong 518000, P.R.China
| | - Peipei Chen
- Shenzhen Institute of Information Technology, Shenzhen, Guangdong 518000, P.R.China
| |
Collapse
|
32
|
Chen Y, Chen Y. [Research of front-end speech enhancement and beamforming algorithm based on dual microphoneforcochlear implant]. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi 2019; 36:468-477. [PMID: 31232551 DOI: 10.7507/1001-5515.201810025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
Speech enhancement methods based on microphone array adopt many microphones to record speech signal simultaneously. As spatial information is increased, these methods can increase speech recognition for cochlear implant in noisy environment. Due to the size limitation, the number of microphones used in the cochlear implant cannot be too large, which limits the design of microphone array beamforming. To balance the size limitation of cochlear implant and the spatial orientation information of the signal acquisition, we propose a speech enhancement and beamforming algorithm based on dual thin uni-directional / omni-directional microphone pairs (TP) in this paper. Each TP microphone contains two sound tubes for signal acquisition, which increase the overall spatial orientation information. In this paper, we discuss the beamforming characteristics with different gain vectors and the influence of the inter-microphone distance on beamforming, which provides valuable theoretical analysis and engineering parameters for the application of dual microphone speech enhancement technology in cochlear implants.
Collapse
Affiliation(s)
- Yousheng Chen
- Shenzhen Institute of Information Technology, Shenzhen, Guangdong 518000, P.R.China
| | - Yan Chen
- Shenzhen Institute of Information Technology, Shenzhen, Guangdong 518000,
| |
Collapse
|
33
|
Flanagan S, Zorilă TC, Stylianou Y, Moore BCJ. Speech Processing to Improve the Perception of Speech in Background Noise for Children With Auditory Processing Disorder and Typically Developing Peers. Trends Hear 2019; 22:2331216518756533. [PMID: 29441834 PMCID: PMC5815419 DOI: 10.1177/2331216518756533] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Auditory processing disorder (APD) may be diagnosed when a child has listening difficulties but has normal audiometric thresholds. For adults with normal hearing and with mild-to-moderate hearing impairment, an algorithm called spectral shaping with dynamic range compression (SSDRC) has been shown to increase the intelligibility of speech when background noise is added after the processing. Here, we assessed the effect of such processing using 8 children with APD and 10 age-matched control children. The loudness of the processed and unprocessed sentences was matched using a loudness model. The task was to repeat back sentences produced by a female speaker when presented with either speech-shaped noise (SSN) or a male competing speaker (CS) at two signal-to-background ratios (SBRs). Speech identification was significantly better with SSDRC processing than without, for both groups. The benefit of SSDRC processing was greater for the SSN than for the CS background. For the SSN, scores were similar for the two groups at both SBRs. For the CS, the APD group performed significantly more poorly than the control group. The overall improvement produced by SSDRC processing could be useful for enhancing communication in a classroom where the teacher's voice is broadcast using a wireless system.
Collapse
Affiliation(s)
- Sheila Flanagan
- 1 Department of Experimental Psychology, University of Cambridge, UK
| | | | - Yannis Stylianou
- 2 Toshiba Research Europe Ltd., Cambridge Research Laboratory, UK.,3 Department of Computer Science, University of Crete, Heraklion, Greece
| | - Brian C J Moore
- 1 Department of Experimental Psychology, University of Cambridge, UK
| |
Collapse
|
34
|
Abstract
For supervised speech enhancement, contextual information is important for accurate mask estimation or spectral mapping. However, commonly used deep neural networks (DNNs) are limited in capturing temporal contexts. To leverage long-term contexts for tracking a target speaker, we treat speech enhancement as a sequence-to-sequence mapping, and present a novel convolutional neural network (CNN) architecture for monaural speech enhancement. The key idea is to systematically aggregate contexts through dilated convolutions, which significantly expand receptive fields. The CNN model additionally incorporates gating mechanisms and residual learning. Our experimental results suggest that the proposed model generalizes well to untrained noises and untrained speakers. It consistently outperforms a DNN, a unidirectional long short-term memory (LSTM) model and a bidirectional LSTM model in terms of objective speech intelligibility and quality metrics. Moreover, the proposed model has far fewer parameters than DNN and LSTM models.
Collapse
Affiliation(s)
- Ke Tan
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, 43210-1277 USA
| | - Jitong Chen
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210-1277, USA . He is now with Silicon Valley AI Lab at Baidu Research, 1195 Bordeaux Drive, Sunnyvale, CA 94089, USA
| | - DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210-1277, USA
| |
Collapse
|
35
|
Abstract
Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.
Collapse
Affiliation(s)
- DeLiang Wang
- Department of Computer Science and Engineering and the Center for Cognitive and Brain Sciences, The Ohio State University, Columbus, OH 43210 USA, and also with the Center of Intelligent Acoustics and Immersive Communications, Northwestern Polytechnical University, Xi'an 710072, China
| | - Jitong Chen
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210 USA. He is now with Silicon Valley AI Lab, Baidu Research, Sunnyvale, CA 94089 USA
| |
Collapse
|
36
|
Abstract
State-of-the-art noise power spectral density (PSD) estimation techniques for speech enhancement utilize the so-called speech presence probability (SPP). However, in highly non-stationary environments, SPP-based techniques could still suffer from inaccurate estimation, leading to significant amount of residual noise or speech distortion. In this paper, we propose to improve speech enhancement by deploying the bone-conduction (BC) sensor, which is known to be relatively insensitive to the environmental noise compared to the regular air-conduction (AC) microphone. A strategy is suggested to utilized the BC sensor characteristics for assisting the AC microphone in better SPP-based noise estimation. To our knowledge, no previous work has incorporated the BC sensor in this noise estimation aspect. Consequently, the proposed strategy can possibly be combined with other BC sensor assisted speech enhancement techniques. We show the feasibility and potential of the proposed method for improving the enhanced speech quality by both objective and subjective tests.
Collapse
Affiliation(s)
- Ching-Hua Lee
- Department of Electrical and Computer Engineering University of California, San Diego
| | - Bhaskar D Rao
- Department of Electrical and Computer Engineering University of California, San Diego
| | - Harinath Garudadri
- Department of Electrical and Computer Engineering University of California, San Diego
| |
Collapse
|
37
|
Shankar N, Kucuk A, Reddy CKA, Bhat GS, Panahi IMS. Influence of MVDR beamformer on a Speech Enhancement based Smartphone application for Hearing Aids. Annu Int Conf IEEE Eng Med Biol Soc 2018; 2018:417-420. [PMID: 30440422 PMCID: PMC7398114 DOI: 10.1109/embc.2018.8512369] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This paper presents the minimum variance distortionless response (MVDR) beamformer combined with a Speech Enhancement (SE) gain function as a real-time application running on smartphones that work as an assistive device to Hearing Aids. It has been shown that beamforming techniques improve the Signal to Noise Ratio (SNR) in noisy conditions. In the proposed algorithm, MVDR beamformer is used as an SNR booster for the SE method. The proposed SE gain is based on the Log-Spectral Amplitude estimator to improve the speech quality in the presence of different background noises. Objective evaluation and intelligibility measures support the theoretical analysis and show significant improvements of the proposed method in comparison with existing methods. Subjective test results show the effectiveness of the application in real-world noisy conditions at SNR levels of -5 dB, 0 dB, and 5 dB.
Collapse
|
38
|
Abstract
This study examined the perceptual consequences of three speech enhancement schemes based on multiband nonlinear expansion of temporal envelope fluctuations between 10 and 20 Hz: (a) "idealized" envelope expansion of the speech before the addition of stationary background noise, (b) envelope expansion of the noisy speech, and (c) envelope expansion of only those time-frequency segments of the noisy speech that exhibited signal-to-noise ratios (SNRs) above -10 dB. Linear processing was considered as a reference condition. The performance was evaluated by measuring consonant recognition and consonant confusions in normal-hearing and hearing-impaired listeners using consonant-vowel nonsense syllables presented in background noise. Envelope expansion of the noisy speech showed no significant effect on the overall consonant recognition performance relative to linear processing. In contrast, SNR-based envelope expansion of the noisy speech improved the overall consonant recognition performance equivalent to a 1- to 2-dB improvement in SNR, mainly by improving the recognition of some of the stop consonants. The effect of the SNR-based envelope expansion was similar to the effect of envelope-expanding the clean speech before the addition of noise.
Collapse
Affiliation(s)
- Alan Wiinberg
- 1 Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Lyngby, Denmark
| | - Johannes Zaar
- 1 Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Lyngby, Denmark
| | - Torsten Dau
- 1 Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Lyngby, Denmark
| |
Collapse
|
39
|
Chen YY. Speech Enhancement of Mobile Devices Based on the Integration of a Dual Microphone Array and a Background Noise Elimination Algorithm. Sensors (Basel) 2018; 18:E1467. [PMID: 29738481 DOI: 10.3390/s18051467] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 04/29/2018] [Accepted: 05/03/2018] [Indexed: 11/21/2022]
Abstract
Mobile devices are often used in our daily lives for the purposes of speech and communication. The speech quality of mobile devices is always degraded due to the environmental noises surrounding mobile device users. Regretfully, an effective background noise reduction solution cannot easily be developed for this speech enhancement problem. Due to these depicted reasons, a methodology is systematically proposed to eliminate the effects of background noises for the speech communication of mobile devices. This methodology integrates a dual microphone array with a background noise elimination algorithm. The proposed background noise elimination algorithm includes a whitening process, a speech modelling method and an H2 estimator. Due to the adoption of the dual microphone array, a low-cost design can be obtained for the speech enhancement of mobile devices. Practical tests have proven that this proposed method is immune to random background noises, and noiseless speech can be obtained after executing this denoise process.
Collapse
|
40
|
Abstract
In this paper, we propose a new speech enhancement algorithm based on wavelet packet decomposition and mask filtering. In the traditional mask filtering such as ideal binary mask (IBM), the basic idea is to classify speech components as target signal and non-speech components as background noises. However, speech and non-speech components cannot be well separated in target signal and background noise. Therefore, the IBM has residual noise and signal loss. To overcome this problem, the proposed algorithm used semi-soft mask filter to exponentially increase. The semi-soft mask minimizes signal loss and the exponential filter removes residual noise. We performed experiments using various types of speech and noise signals, and experimental results show that the proposed algorithm achieves better performances than the traditional other speech enhancement algorithms.
Collapse
Affiliation(s)
- Gihyoun Lee
- a Department of Medical & Biological Engineering, Graduate School , Kyungpook National University , Daegu , Korea
| | - Sung Dae Na
- a Department of Medical & Biological Engineering, Graduate School , Kyungpook National University , Daegu , Korea
| | - KiWoong Seong
- b Department of Biomedical Engineering , Kyungpook National University Hospital , Daegu , Korea
| | - Jin-Ho Cho
- c School of Electronics Engineering, College of IT Engineering, Kyungpook National University , Daegu , Korea
| | - Myoung Nam Kim
- d Department of Biomedical Engineering, School of Medicine , Kyungpook National University , Daegu , Korea
| |
Collapse
|
41
|
Chen F, Li S, Li C, Liu M, Li Z, Xue H, Jing X, Wang J. A Novel Method for Speech Acquisition and Enhancement by 94 GHz Millimeter-Wave Sensor. Sensors (Basel) 2015; 16:E50. [PMID: 26729126 DOI: 10.3390/s16010050] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2015] [Revised: 12/10/2015] [Accepted: 12/23/2015] [Indexed: 12/02/2022]
Abstract
In order to improve the speech acquisition ability of a non-contact method, a 94 GHz millimeter wave (MMW) radar sensor was employed to detect speech signals. This novel non-contact speech acquisition method was shown to have high directional sensitivity, and to be immune to strong acoustical disturbance. However, MMW radar speech is often degraded by combined sources of noise, which mainly include harmonic, electrical circuit and channel noise. In this paper, an algorithm combining empirical mode decomposition (EMD) and mutual information entropy (MIE) was proposed for enhancing the perceptibility and intelligibility of radar speech. Firstly, the radar speech signal was adaptively decomposed into oscillatory components called intrinsic mode functions (IMFs) by EMD. Secondly, MIE was used to determine the number of reconstructive components, and then an adaptive threshold was employed to remove the noise from the radar speech. The experimental results show that human speech can be effectively acquired by a 94 GHz MMW radar sensor when the detection distance is 20 m. Moreover, the noise of the radar speech is greatly suppressed and the speech sounds become more pleasant to human listeners after being enhanced by the proposed algorithm, suggesting that this novel speech acquisition and enhancement method will provide a promising alternative for various applications associated with speech detection.
Collapse
|
42
|
Abstract
Current cochlear implant (CI) strategies carry speech information via the waveform envelope in frequency subbands. CIs require efficient speech processing to maximize information transfer to the brain, especially in background noise, where the speech envelope is not robust to noise interference. In such conditions, the envelope, after decomposition into frequency bands, may be enhanced by sparse transformations, such as nonnegative matrix factorization (NMF). Here, a novel CI processing algorithm is described, which works by applying NMF to the envelope matrix (envelopogram) of 22 frequency channels in order to improve performance in noisy environments. It is evaluated for speech in eight-talker babble noise. The critical sparsity constraint parameter was first tuned using objective measures and then evaluated with subjective speech perception experiments for both normal hearing and CI subjects. Results from vocoder simulations with 10 normal hearing subjects showed that the algorithm significantly enhances speech intelligibility with the selected sparsity constraints. Results from eight CI subjects showed no significant overall improvement compared with the standard advanced combination encoder algorithm, but a trend toward improvement of word identification of about 10 percentage points at +15 dB signal-to-noise ratio (SNR) was observed in the eight CI subjects. Additionally, a considerable reduction of the spread of speech perception performance from 40% to 93% for advanced combination encoder to 80% to 100% for the suggested NMF coding strategy was observed.
Collapse
Affiliation(s)
- Hongmei Hu
- Institute of Sound and Vibration Research, University of Southampton, UK Medizinische Physik, Universität Oldenburg and Cluster of Excellence "Hearing4all", Oldenburg, Germany
| | - Mark E Lutman
- Institute of Sound and Vibration Research, University of Southampton, UK
| | - Stephan D Ewert
- Medizinische Physik, Universität Oldenburg and Cluster of Excellence "Hearing4all", Oldenburg, Germany
| | - Guoping Li
- Institute of Sound and Vibration Research, University of Southampton, UK The Ear Institute, Faculty of Brain Sciences, University College London, UK
| | - Stefan Bleeck
- Institute of Sound and Vibration Research, University of Southampton, UK
| |
Collapse
|
43
|
Goldsworthy RL. Two-microphone spatial filtering improves speech reception for cochlear-implant users in reverberant conditions with multiple noise sources. Trends Hear 2014; 18:18/0/2331216514555489. [PMID: 25330772 PMCID: PMC4227667 DOI: 10.1177/2331216514555489] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
This study evaluates a spatial-filtering algorithm as a method to improve speech reception for cochlear-implant (CI) users in reverberant environments with multiple noise sources. The algorithm was designed to filter sounds using phase differences between two microphones situated 1 cm apart in a behind-the-ear hearing-aid capsule. Speech reception thresholds (SRTs) were measured using a Coordinate Response Measure for six CI users in 27 listening conditions including each combination of reverberation level (T60=0, 270, and 540 ms), number of noise sources (1, 4, and 11), and signal-processing algorithm (omnidirectional response, dipole-directional response, and spatial-filtering algorithm). Noise sources were time-reversed speech segments randomly drawn from the Institute of Electrical and Electronics Engineers sentence recordings. Target speech and noise sources were processed using a room simulation method allowing precise control over reverberation times and sound-source locations. The spatial-filtering algorithm was found to provide improvements in SRTs on the order of 6.5 to 11.0 dB across listening conditions compared with the omnidirectional response. This result indicates that such phase-based spatial filtering can improve speech reception for CI users even in highly reverberant conditions with multiple noise sources.
Collapse
|
44
|
Abstract
This paper presents the single channel speech enhancement system using subband Kalman filtering by estimating optimal Autoregressive (AR) coefficients and variance for speech and noise, using Weighted Linear Prediction (WLP) and Noise Weighting Function (NWF). The system is applied for normal and Oesophageal speech signals. The method is evaluated by Perceptual Evaluation of Speech Quality (PESQ) score and Signal to Noise Ratio (SNR) improvement for normal speech and Harmonic to Noise Ratio (HNR) for Oesophageal Speech (OES). Compared with previous systems, the normal speech indicates 30% increase in PESQ score, 4 dB SNR improvement and OES shows 3 dB HNR improvement.
Collapse
Affiliation(s)
- Rizwan Ishaq
- Deustotech-LIFE, University of Deusto, Bilbao, Spain
| | | |
Collapse
|
45
|
Abstract
Most noise reduction algorithms rely on obtaining reliable estimates of the SNR of each frequency bin. For that reason, much work has been done in analyzing the behavior and performance of SNR estimation algorithms in the context of improving speech quality and reducing speech distortions (e.g., musical noise). Comparatively little work has been reported, however, regarding the analysis and investigation of the effect of errors in SNR estimation on speech intelligibility. It is not known, for instance, whether it is the errors in SNR overestimation, errors in SNR underestimation, or both that are harmful to speech intelligibility. Errors in SNR estimation produce concomitant errors in the computation of the gain (suppression) function, and the impact of gain estimation errors on speech intelligibility is unclear. The present study assesses the effect of SNR estimation errors on gain function estimation via sensitivity analysis. Intelligibility listening studies were conducted to validate the sensitivity analysis. Results indicated that speech intelligibility is severely compromised when SNR and gain over-estimation errors are introduced in spectral components with negative SNR. A theoretical upper bound on the gain function is derived that can be used to constrain the values of the gain function so as to ensure that SNR overestimation errors are minimized. Speech enhancement algorithms that can limit the values of the gain function to fall within this upper bound can improve speech intelligibility.
Collapse
Affiliation(s)
| | - Philipos C. Loizou
- Address correspondence to: Philipos C. Loizou, Ph.D., Department of Electrical Engineering, University of Texas at Dallas, 800 West Campbell Road (EC33), Richardson, TX 75080-0688, , Phone : (972) 883-4617, Fax : (972) 883-2710
| |
Collapse
|
46
|
Abstract
Making meaningful comparisons between the performance of the various speech enhancement algorithms proposed over the years, has been elusive due to lack of a common speech database, differences in the types of noise used and differences in the testing methodology. To facilitate such comparisons, we report on the development of a noisy speech corpus suitable for evaluation of speech enhancement algorithms. This corpus is subsequently used for the subjective evaluation of 13 speech enhancement methods encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms. The subjective evaluation was performed by Dynastat, Inc. using the ITU-T P.835 methodology designed to evaluate the speech quality along three dimensions: signal distortion, noise distortion and overall quality. This paper reports the results of the subjective tests.
Collapse
Affiliation(s)
- Yi Hu
- Department of Electrical Engineering The University of Texas at Dallas Richardson, Texas 75083-0688, USA
| | | |
Collapse
|
47
|
Abstract
This paper focuses on optimal estimators of the magnitude spectrum for speech enhancement. We present an analytical solution for estimating in the MMSE sense the magnitude spectrum when the clean speech DFT coefficients are modeled by a Laplacian distribution and the noise DFT coefficients are modeled by a Gaussian distribution. Furthermore, we derive the MMSE estimator under speech presence uncertainty and a Laplacian statistical model. Results indicated that the Laplacian-based MMSE estimator yielded less residual noise in the enhanced speech than the traditional Gaussian-based MMSE estimator. Overall, the present study demonstrates that the assumed distribution of the DFT coefficients can have a significant effect on the quality of the enhanced speech.
Collapse
Affiliation(s)
| | - Philipos C. Loizou
- * Corresponding author: Department of Electrical Engineering, University of Texas at Dallas, Richardson, Texas 75083-0688. Phone: (972) 883-4617 Fax: (972) 883-2710
| |
Collapse
|