1
|
Qu L, Weber C, Wermter S. LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2772-2782. [PMID: 35867361 DOI: 10.1109/tnnls.2022.3191677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 that consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is first pre-trained on ∼ 2400 -h multilingual (e.g., English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID and TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and speaker-independent settings. In addition to English, we conduct Chinese speech reconstruction on the Chinese Mandarin Lip Reading (CMLR) dataset to verify the impact on transferability. Finally, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve the state-of-the-art performance on both English and Chinese benchmark datasets.
Collapse
|
2
|
Gao T, Pan Q, Zhou J, Wang H, Tao L, Kwan HK. A Novel Attention-Guided Generative Adversarial Network for Whisper-to-Normal Speech Conversion. Cognit Comput 2023. [DOI: 10.1007/s12559-023-10108-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
3
|
Serrano García L, Raman S, Hernáez Rioja I, Navas Cordón E, Sanchez J, Saratxaga I. A Spanish multispeaker database of esophageal speech. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2020.101168] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
4
|
McLoughlin IV, Perrotin O, Sharifzadeh H, Allen J, Song Y. Automated Assessment of Glottal Dysfunction Through Unified Acoustic Voice Analysis. J Voice 2020; 36:743-754. [DOI: 10.1016/j.jvoice.2020.08.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Revised: 08/23/2020] [Accepted: 08/25/2020] [Indexed: 10/23/2022]
|
5
|
[Intraoral voice recording-towards a new smartphone-based method for vocal rehabilitation. German version]. HNO 2018; 66:760-768. [PMID: 30203388 DOI: 10.1007/s00106-018-0548-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
After laryngectomy, a new voice is needed. We present the first steps in the development of a smartphone-based method. A microphone is placed in the mouth to record the pseudo-whispering voice of laryngectomized patients. This recording is analyzed by voice recognition software followed by voice synthesis. Eventually, this will be performed on a smartphone. We placed a microphone at 10 different places inside and outside the mouth (two in front of the mouth (at 2 and 20 cm), five on the palate and three on the lower jaw) and made voice recordings in eight healthy men. These recordings were analyzed by voice recognition software. The text generated by the software was compared with the original text. Over all positions, the correct detection of words recorded in the mouth was 19.3% vs. 75.2% (p = 0.01) outside the mouth. In the mouth, recording taken on the maxilla (22.8%) was much better than on the mandible (13.5%) (p = 0.01). The optimum position for a microphone on the maxilla was at the highest point of the palate with 31.9% correct word identification there (p = 0.028). Further investigations have to be undertaken with forthcoming development of smartphone processing power and with development of a smartphone-based voice recognition application.
Collapse
|
6
|
Schuldt T, Kramp B, Ovari A, Timmermann D, Dommerich S, Mlynski R, Ottl P. Intraoral voice recording-towards a new smartphone-based method for vocal rehabilitation. HNO 2018; 66:63-70. [PMID: 30105524 DOI: 10.1007/s00106-018-0549-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
After laryngectomy, a new voice is needed. We present the first steps in the development of a smartphone-based method. A microphone is placed in the mouth to record the pseudo-whispering voice of laryngectomized patients. This recording is analyzed by voice recognition software followed by voice synthesis. Eventually, this will be performed on a smartphone. We placed a microphone at 10 different places inside and outside the mouth (two in front of the mouth (at 2 and 20 cm), five on the palate and three on the lower jaw) and made voice recordings in eight healthy men. These recordings were analyzed by voice recognition software. The text generated by the software was compared with the original text. Over all positions, the correct detection of words recorded in the mouth was 19.3% vs. 75.2% (p = 0.01) outside the mouth. In the mouth, recording taken on the maxilla (22.8%) was much better than on the mandible (13.5%) (p = 0.01). The optimum position for a microphone on the maxilla was at the highest point of the palate with 31.9% correct word identification there (p = 0.028). Further investigations have to be undertaken with forthcoming development of smartphone processing power and with development of a smartphone-based voice recognition application.
Collapse
Affiliation(s)
- T Schuldt
- Department of Otorhinolaryngology, Head and Neck Surgery, "Otto Körner", Rostock University Medical Center, Doberaner Straße 137/139, 18057, Rostock, Germany.
| | - B Kramp
- Department of Otorhinolaryngology, Head and Neck Surgery, "Otto Körner", Rostock University Medical Center, Doberaner Straße 137/139, 18057, Rostock, Germany
| | - A Ovari
- Department of Otorhinolaryngology, Asklepios Klinik St. Georg, Hamburg, Germany
| | - D Timmermann
- Faculty of Computer Science and Electrical Engineering, Rostock University, 18059, Rostock, Germany
| | - S Dommerich
- Department of Otorhinolaryngology, Head and Neck Surgery, Berlin Charite University Medical Center, 10117, Berlin, Germany
| | - R Mlynski
- Department of Otorhinolaryngology, Head and Neck Surgery, "Otto Körner", Rostock University Medical Center, Doberaner Straße 137/139, 18057, Rostock, Germany
| | - P Ottl
- Department of Dentistry, Rostock University Medical Center, 18057, Rostock, Germany
| |
Collapse
|
7
|
G NM, Ghosh PK. Reconstruction of articulatory movements during neutral speech from those during whispered speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 143:3352. [PMID: 29960421 DOI: 10.1121/1.5039750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A transformation function (TF) that reconstructs neutral speech articulatory trajectories (NATs) from whispered speech articulatory trajectories (WATs) is investigated, such that the dynamic time warped (DTW) distance between the transformed whispered and the original neutral articulatory movements is minimized. Three candidate TFs are considered: an affine function with a diagonal matrix ( Ad) which reconstructs one NAT from the corresponding WAT, an affine function with a full matrix ( Af) and a deep neural network (DNN) based nonlinear function which reconstruct each NAT from all WATs. Experiments reveal that the transformation could be approximated well by Af, since it generalizes better across subjects and achieves the least DTW distance of 5.20 (±1.27) mm (on average), with an improvement of 7.47%, 4.76%, and 7.64% (relative) compared to that with Ad, DNN, and the best baseline scheme, respectively. Further analysis to understand the differences in neutral and whispered articulation reveals that the whispered articulators exhibit exaggerated movements in order to reconstruct the lip movements during neutral speech. It is also observed that among the articulators considered in the study, the tongue exhibits a higher precision and stability while whispering, implying that subjects control their tongue movements carefully in order to render an intelligible whispered speech.
Collapse
Affiliation(s)
- Nisha Meenakshi G
- Electrical Engineering, Indian Institute of Science, Bangalore-560012, India
| | | |
Collapse
|
8
|
McLoughlin I, Li J, Song Y, Sharifzadeh HR. Speech reconstruction using a deep partially supervised neural network. Healthc Technol Lett 2017; 4:129-133. [PMID: 28868149 PMCID: PMC5569940 DOI: 10.1049/htl.2016.0103] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2016] [Revised: 04/16/2017] [Accepted: 05/01/2017] [Indexed: 11/29/2022] Open
Abstract
Statistical speech reconstruction for larynx-related dysphonia has achieved good performance using Gaussian mixture models and, more recently, restricted Boltzmann machine arrays; however, deep neural network (DNN)-based systems have been hampered by the limited amount of training data available from individual voice-loss patients. The authors propose a novel DNN structure that allows a partially supervised training approach on spectral features from smaller data sets, yielding very good results compared with the current state-of-the-art.
Collapse
Affiliation(s)
- Ian McLoughlin
- School of Computing, The University of Kent, Medway, UK.,National Engineering Laboratory of Speech and Language Information Processing, The University of Science and Technology of China, Hefei, Anhui, People's Republic of China
| | - Jingjie Li
- National Engineering Laboratory of Speech and Language Information Processing, The University of Science and Technology of China, Hefei, Anhui, People's Republic of China
| | - Yan Song
- National Engineering Laboratory of Speech and Language Information Processing, The University of Science and Technology of China, Hefei, Anhui, People's Republic of China
| | - Hamid R Sharifzadeh
- Signal Processing Laboratory, Unitec Institute of Technology, Auckland, New Zealand
| |
Collapse
|
9
|
Fusion of auditory inspired amplitude modulation spectrum and cepstral features for whispered and normal speech speaker verification. COMPUT SPEECH LANG 2017. [DOI: 10.1016/j.csl.2017.04.004] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
10
|
Lachhab O, Di Martino J, Elhaj EI, Hammouch A. A preliminary study on improving the recognition of esophageal speech using a hybrid system based on statistical voice conversion. SPRINGERPLUS 2015; 4:644. [PMID: 26543778 PMCID: PMC4627987 DOI: 10.1186/s40064-015-1428-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 10/12/2015] [Indexed: 11/23/2022]
Abstract
In this paper, we propose a hybrid system based on a modified statistical GMM voice conversion algorithm for improving the recognition of esophageal speech. This hybrid system aims to compensate for the distorted information present in the esophageal acoustic features by using a voice conversion method. The esophageal speech is converted into a “target” laryngeal speech using an iterative statistical estimation of a transformation function. We did not apply a speech synthesizer for reconstructing the converted speech signal, given that the converted Mel cepstral vectors are used directly as input of our speech recognition system. Furthermore the feature vectors are linearly transformed by the HLDA (heteroscedastic linear discriminant analysis) method to reduce their size in a smaller space having good discriminative properties. The experimental results demonstrate that our proposed system provides an improvement of the phone recognition accuracy with an absolute increase of 3.40 % when compared with the phone recognition accuracy obtained with neither HLDA nor voice conversion.
Collapse
Affiliation(s)
- Othman Lachhab
- LRGE Laboratory, ENSET, Mohammed 5 University, Madinat Al Irfane, Rabat, Morocco
| | | | | | - Ahmed Hammouch
- LRGE Laboratory, ENSET, Mohammed 5 University, Madinat Al Irfane, Rabat, Morocco
| |
Collapse
|
11
|
Mcloughlin IV, Sharifzadeh HR, Tan SL, Li J, Song Y. Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation. ACM TRANSACTIONS ON ACCESSIBLE COMPUTING 2015. [DOI: 10.1145/2737724] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Whispering is a natural, unphonated, secondary aspect of speech communications for most people. However, it is the primary mechanism of communications for some speakers who have impaired voice production mechanisms, such as partial laryngectomees, as well as for those prescribed voice rest, which often follows surgery or damage to the larynx. Unlike most people, who choose when to whisper and when not to, these speakers may have little choice but to rely on whispers for much of their daily vocal interaction.
Even though most speakers will whisper at times, and some speakers can only whisper, the majority of today’s computational speech technology systems assume or require phonated speech. This article considers conversion of whispers into natural-sounding phonated speech as a noninvasive prosthetic aid for people with voice impairments who can only whisper. As a by-product, the technique is also useful for unimpaired speakers who choose to whisper.
Speech reconstruction systems can be classified into those requiring training and those that do not. Among the latter, a recent parametric reconstruction framework is explored and then enhanced through a refined estimation of plausible pitch from weighted formant differences. The improved reconstruction framework, with proposed formant-derived artificial pitch modulation, is validated through subjective and objective comparison tests alongside state-of-the-art alternatives.
Collapse
Affiliation(s)
- Ian V. Mcloughlin
- The University of Science and Technology of China, Hefei, Anhui, China
| | | | - Su Lim Tan
- Singapore Institute of Technology, Singapore
| | - Jingjie Li
- The University of Science and Technology of China, Hefei, Anhui, China
| | - Yan Song
- The University of Science and Technology of China, Hefei, Anhui, China
| |
Collapse
|
12
|
Drugman T, Alku P, Alwan A, Yegnanarayana B. Glottal source processing: From analysis to applications. COMPUT SPEECH LANG 2014. [DOI: 10.1016/j.csl.2014.03.003] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
13
|
A Comprehensive Vowel Space for Whispered Speech. J Voice 2012; 26:e49-56. [PMID: 21550772 DOI: 10.1016/j.jvoice.2010.12.002] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2010] [Accepted: 12/06/2010] [Indexed: 11/21/2022]
|