1
|
Zheng C, Zhang H, Liu W, Luo X, Li A, Li X, Moore BCJ. Sixty Years of Frequency-Domain Monaural Speech Enhancement: From Traditional to Deep Learning Methods. Trends Hear 2023; 27:23312165231209913. [PMID: 37956661 PMCID: PMC10658184 DOI: 10.1177/23312165231209913] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 10/09/2023] [Indexed: 11/15/2023] Open
Abstract
Frequency-domain monaural speech enhancement has been extensively studied for over 60 years, and a great number of methods have been proposed and applied to many devices. In the last decade, monaural speech enhancement has made tremendous progress with the advent and development of deep learning, and performance using such methods has been greatly improved relative to traditional methods. This survey paper first provides a comprehensive overview of traditional and deep-learning methods for monaural speech enhancement in the frequency domain. The fundamental assumptions of each approach are then summarized and analyzed to clarify their limitations and advantages. A comprehensive evaluation of some typical methods was conducted using the WSJ + Deep Noise Suppression (DNS) challenge and Voice Bank + DEMAND datasets to give an intuitive and unified comparison. The benefits of monaural speech enhancement methods using objective metrics relevant for normal-hearing and hearing-impaired listeners were evaluated. The objective test results showed that compression of the input features was important for simulated normal-hearing listeners but not for simulated hearing-impaired listeners. Potential future research and development topics in monaural speech enhancement are suggested.
Collapse
Affiliation(s)
- Chengshi Zheng
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Huiyong Zhang
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wenzhe Liu
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoxue Luo
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Andong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xiaodong Li
- Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Brian C. J. Moore
- Cambridge Hearing Group, Department of Psychology, University of Cambridge, Cambridge, UK
| |
Collapse
|
2
|
Mahmmod BM, Ramli AR, Baker T, Al-Obeidat F, Abdulhussain SH, Jassim WA. Speech Enhancement Algorithm Based on Super-Gaussian Modeling and Orthogonal Polynomials. IEEE ACCESS 2019; 7:103485-103504. [DOI: 10.1109/access.2019.2929864] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
3
|
Mahmmod BM, Ramli AR, Abdulhussian SH, Al-Haddad SAR, Jassim WA. Low-Distortion MMSE Speech Enhancement Estimator Based on Laplacian Prior. IEEE ACCESS 2017; 5:9866-9881. [DOI: 10.1109/access.2017.2699782] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
4
|
Heart Rate Monitoring Using a Slow–Fast Adaptive Comb Filter to Eliminate Motion Artifacts. J Med Biol Eng 2016. [DOI: 10.1007/s40846-016-0183-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
5
|
Wei Z, Xueyun W, Jian jian Z, Hongxing L. Noninvasive fetal ECG estimation using adaptive comb filter. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2013; 112:125-134. [PMID: 23942332 DOI: 10.1016/j.cmpb.2013.07.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2012] [Revised: 06/06/2013] [Accepted: 07/21/2013] [Indexed: 06/02/2023]
Abstract
This paper describes a robust and simple algorithm for fetal electrocardiogram (FECG) estimation from abdominal signal using adaptive comb filter (ACF). The ACF can adjust itself to the temporal variations in fundamental frequency, which makes it qualified for the estimation of quasi-periodic component from physiologic signal, such as ECG. The validity and performance of the described method are confirmed through experiments on real fetal ECG data. A comparison with the well-known independent component analysis (ICA) method has also been presented.
Collapse
Affiliation(s)
- Zheng Wei
- School of Electronic Information, Jiangsu University of Science and Technology, ZhenJiang, China.
| | | | | | | |
Collapse
|
6
|
Abe T, Matsumoto M, Hashimoto S. Noise reduction combining time-frequency epsilon-filter and M-transform. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2008; 124:994-1005. [PMID: 18681591 DOI: 10.1121/1.2940584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
This paper introduces noise reduction combining time-frequency epsilon-filter (TF epsilon-filter) and time-frequency M-transform (TF M-transform). Musical noise is an offensive noise generated due to noise reduction in the time-frequency domain such as spectral subtraction and TF epsilon-filter. It has a deleterious effect on speech recognition. To solve the problem, M-transform is introduced. M-transform is a linear transform based on M-sequence. The method combining the time-domain epsilon-filter (TD epsilon-filter) and time-domain M-transform (TD M-transform) can reduce not only white noise but also impulse noise. Musical noise is isolated in the time-frequency domain, which is similar to impulse noise in the time domain. On these prospects, this paper aims to reduce musical noise by improving M-transform for the time-frequency domain. Noise reduction by using TD M-transform and the TD epsilon-filter is first explained to clarify its features. Then, an improved method applying M-transform to the time-frequency domain, namely TF M-transform, is described. Noise reduction combining the TF epsilon-filter and TF M-transform is also proposed. The proposed method can reduce not only high-level nonstationary noise but also musical noise. Experimental results are also given to demonstrate the performance of the proposed method.
Collapse
Affiliation(s)
- Tomomi Abe
- Major in Pure and Applied Physics, Waseda University, Tokyo, Japan.
| | | | | |
Collapse
|
7
|
|
8
|
Abe T, Matsumoto M, Hashimoto S. Noise reduction combining time-domain epsilon-filter and time-frequency epsilon-filter. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2007; 122:2697-2705. [PMID: 18189562 DOI: 10.1121/1.2785038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
A time-domain epsilon-filter (TD epsilon-filter) is a nonlinear filter that can reduce noise while preserving a signal that varies drastically, such as a speech signal. Although the filter design is simple, it can effectively reduce noise. It is applicable not only to stationary noise but also to nonstationary noise. It cannot, however, be applied when the amplitude of noise is relatively large. This paper introduces an advanced method for noise reduction that applies an epsilon-filter to complex spectra, namely a time-frequency epsilon-filter (TF epsilon-filter). This paper also introduces noise reduction combining a TD epsilon-filter and a TF epsilon-filter. An advanced method called a variable time-frequency epsilon-filter is also proposed. First, the algorithm of the TD epsilon-filter is explained to clarify the problem. Then, the algorithms of the proposed methods are explained. By utilizing an epsilon-filter in the frequency domain, the proposed method can reduce not only noise that has a relatively small amplitude but also noise that has a relatively large amplitude. Experimental results are also given to demonstrate the performance of the proposed methods in comparison to the results of some conventional methods.
Collapse
Affiliation(s)
- Tomomi Abe
- Major in Pure and Applied Physics, Waseda University, 55N-4F-10A, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan.
| | | | | |
Collapse
|
9
|
Irino T, Patterson RD, Kawahara H. Speech Segregation Using an Auditory Vocoder With Event-Synchronous Enhancements. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2006; 14:2212-2221. [PMID: 20191101 PMCID: PMC2828642 DOI: 10.1109/tasl.2006.872611] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
We propose a new method to segregate concurrent speech sounds using an auditory version of a channel vocoder. The auditory representation of sound, referred to as an "auditory image," preserves fine temporal information, unlike conventional window-based processing systems. This makes it possible to segregate speech sources with an event synchronous procedure. Fundamental frequency information is used to estimate the sequence of glottal pulse times for a target speaker, and to repress the glottal events of other speakers. The procedure leads to robust extraction of the target speech and effective segregation even when the signal-to-noise ratio is as low as 0 dB. Moreover, the segregation performance remains high when the speech contains jitter, or when the estimate of the fundamental frequency F0 is inaccurate. This contrasts with conventional comb-filter methods where errors in F0 estimation produce a marked reduction in performance. We compared the new method to a comb-filter method using a cross-correlation measure and perceptual recognition experiments. The results suggest that the new method has the potential to supplant comb-filter and harmonic-selection methods for speech enhancement.
Collapse
Affiliation(s)
- Toshio Irino
- Faculty of Systems Engineering, Wakayama University, Wakayama 640-8510, Japan
| | - Roy D. Patterson
- Centre for Neural Basis of Hearing, Department of Physiology, Development, and Neuroscience, University of Cambridge, Cambridge CB2 3EG, U.K.
| | - Hideki Kawahara
- Faculty of Systems Engineering, Wakayama University, Wakayama 640-8510, Japan
| |
Collapse
|
10
|
Deshmukh O, Espy-Wilson C, Salomon A, Singh J. Use of temporal information: detection of periodicity, aperiodicity, and pitch in speech. ACTA ACUST UNITED AC 2005. [DOI: 10.1109/tsa.2005.851910] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
11
|
Lazaris AM, Vasdekis SN, Gougoulakis AG, Liakakos TD, Galanis GD, Giannakakis SG, Sechas MN. Assessment of voice quality after carotid endarterectomy. Eur J Vasc Endovasc Surg 2002; 24:344-8. [PMID: 12323178 DOI: 10.1053/ejvs.2002.1725] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
OBJECTIVES vocal cord paralysis is considered a rare complication of carotid endarterectomy (CEA), but alteration in voice quality may be more common. The aim of this prospective study was to evaluate the effect of CEA on voice quality and to correlate any changes with the extent of the dissection. DESIGN-MATERIAL-METHODS: thirty-five patients who underwent CEA were divided in two groups, according to the level of surgical dissection performed. The high-level dissection group was comprised of those patients that required mobilisation of hypoglossal nerve and division of the posterior belly of digastric muscle. The low-level dissection group included the rest. All the patients' voices were recorded and analysed digitally before CEA, one and three months after the operation. Voice data were measured for standard deviation of fundamental frequency, jitter, shimmer and normalised noise energy (NNE). All patients underwent a laryngeal examination pre- and post-operation. RESULTS none of the patients had any vocal cord dysfunction on laryngoscopy. Significant changes of voice quality (jitter, shimmer, NNE) were noticed in the high-level dissection group (p<0.05) one month after the operation. Two months later, the voice changes had subsided, but still significant disturbances remained (jitter, shimmer). CONCLUSIONS voice-related disturbances are far more common following CEA than is generally believed and, although they seem to for the most part temporary, they deserve attention. Specifically, high-level surgical dissection seems to be a risk factor of postoperative vocal impairment.
Collapse
Affiliation(s)
- A M Lazaris
- Vascular Laboratory, 3rd Surgical Department, Athens University, Sotiria Hospital, 72 Sevastopoulou str, 11524 Athens, Greece
| | | | | | | | | | | | | |
Collapse
|
12
|
Sameti H, Sheikhzadeh H, Li Deng, Brennan R. HMM-based strategies for enhancement of speech signals embedded in nonstationary noise. ACTA ACUST UNITED AC 1998. [DOI: 10.1109/89.709670] [Citation(s) in RCA: 131] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
13
|
Morgan D, George E, Lee L, Kay S. Cochannel speaker separation by harmonic enhancement and suppression. ACTA ACUST UNITED AC 1997. [DOI: 10.1109/89.622561] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
14
|
Kates JM. Speech enhancement based on a sinusoidal model. JOURNAL OF SPEECH AND HEARING RESEARCH 1994; 37:449-464. [PMID: 8028327 DOI: 10.1044/jshr.3702.449] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Sinusoidal modeling is a new procedure for representing the speech signal. In this approach, the signal is divided into overlapping segments, the Fourier transform computed for each segment, and a set of desired spectral peaks is identified. The speech is then resynthesized using sinusoids that have the frequency, amplitude, and phase of the selected peaks, with the remaining spectral information being discarded. Using a limited number of sinusoids to reproduce speech in a background of multi-talker speech babble results in a speech signal that has an improved signal-to-noise ratio and enhanced spectral contrast. The more intense spectral components, assumed to be primarily the desired speech, are reproduced, whereas the less intense components, assumed to be primarily background noise, are not. To test the effectiveness of this processing approach as a noise suppression technique, both consonant recognition and perceived speech intelligibility were determined in quiet and in noise for a group of subjects with normal hearing as the number of sinusoids used to represent isolated speech tokens was varied. The results show that reducing the number of sinusoids used to represent the speech causes reduced consonant recognition and perceived intelligibility both in quiet and in noise, and suggests that similar results would be expected for listeners with hearing impairments.
Collapse
Affiliation(s)
- J M Kates
- Center for Research in Speech and Hearing Sciences City University of New York
| |
Collapse
|
15
|
Tanyel M, Lee KY, Chey WY, Chitrapu PR. Multistage enhancement of surface recordings of canine gastric electrical signals. Ann Biomed Eng 1993; 21:337-50. [PMID: 8214818 DOI: 10.1007/bf02368626] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
This article describes a multistage signal processing scheme to enhance the quality of canine gastric signals recorded from the abdominal surface. The scheme involves a cascade application of linear prediction followed by a nonlinear processing known as alpha-TM filtering. The linear prediction is used to separate, in the minimum mean square error sense, the slow wave from other uncorrelated interference signals. We make novel use of the order versus frequency response characteristics of linear predictors to achieve this separation. The nonlinear filtering is used to suppress the residual wide band impulsive noise. Our studies have indicated that such an optimized signal enhancement scheme produces a clean time domain signal, which is easy to interpret visually. It not only preserves the periodicity of the slow wave, but also seems to track any irregularities in the periods. We believe that this last feature, namely the potential to track nonstationarities in the signal, is the main contribution of our approach.
Collapse
Affiliation(s)
- M Tanyel
- Department of ECE, Drexel University, Philadelphia, PA 19104
| | | | | | | |
Collapse
|
16
|
|
17
|
Jae Lim. Evaluation of a correlation subtraction method for enhancing speech degraded by additive white noise. ACTA ACUST UNITED AC 1978. [DOI: 10.1109/tassp.1978.1163129] [Citation(s) in RCA: 69] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
18
|
|