1
|
Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12094338] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Speech emotion recognition (SER) is an important component of emotion computation and signal processing. Recently, many works have applied abundant acoustic features and complex model architectures to enhance the model’s performance, but these works sacrifice the portability of the model. To address this problem, we propose a model utilizing only the fundamental frequency from electroglottograph (EGG) signals. EGG signals are a sort of physiological signal that can directly reflect the movement of the vocal cord. Under the assumption that different acoustic features share similar representations in the internal emotional state, we propose cross-modal emotion distillation (CMED) to train the EGG-based SER model by transferring robust speech emotion representations from the log-Mel-spectrogram-based model. Utilizing the cross-modal emotion distillation, we achieve an increase of recognition accuracy from 58.98% to 66.80% on the S70 subset of the Chinese Dual-mode Emotional Speech Database (CDESD 7-classes) and 32.29% to 42.71% on the EMO-DB (7-classes) dataset, which shows that our proposed method achieves a comparable result with the human subjective experiment and realizes a trade-off between model complexity and performance.
Collapse
|
2
|
Chen L, Ren J, Chen P, Mao X, Zhao Q. Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2. APPL INTELL 2022. [DOI: 10.1007/s10489-021-03075-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
AbstractThis paper proposes a framework of applying only the EGG signal for speech synthesis in the limited categories of contents scenario. EGG is a sort of physiological signal which can reflect the trends of the vocal cord movement. Note that EGG’s different acquisition method contrasted with speech signals, we exploit its application in speech synthesis under the following two scenarios. (1) To synthesize speeches under high noise circumstances, where clean speech signals are unavailable. (2) To enable dumb people who retain vocal cord vibration to speak again. Our study consists of two stages, EGG to text and text to speech. The first is a text content recognition model based on Bi-LSTM, which converts each EGG signal sample into the corresponding text with a limited class of contents. This model achieves 91.12% accuracy on the validation set in a 20-class content recognition experiment. Then the second step synthesizes speeches with the corresponding text and the EGG signal. Based on modified Tacotron-2, our model gains the Mel cepstral distortion (MCD) of 5.877 and the mean opinion score (MOS) of 3.87, which is comparable with the state-of-the-art performance and achieves an improvement by 0.42 and a relatively smaller model size than the origin Tacotron-2. Considering to introduce the characteristics of speakers contained in EGG to the final synthesized speech, we put forward a fine-grained fundamental frequency modification method, which adjusts the fundamental frequency according to EGG signals and achieves a lower MCD of 5.781 and a higher MOS of 3.94 than that without modification.
Collapse
|
5
|
Ternström S. Normalized time-domain parameters for electroglottographic waveforms. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:EL65. [PMID: 31370590 DOI: 10.1121/1.5117174] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 06/28/2019] [Indexed: 06/10/2023]
Abstract
The electroglottographic waveform is of interest for characterizing phonation non-invasively. Existing parameterizations tend to give disparate results because they rely on somewhat arbitrary thresholds and/or contacting events. It is shown that neither are needed for formulating a normalized contact quotient and a normalized peak derivative. A heuristic combination of the two resolves also the ambiguity of a moderate contact quotient, with regard to vocal fold contacting being firm versus weak or absent. As preliminaries, schemes for electroglottography signal preconditioning and time-domain period detection are described that improve somewhat on similar methods. The algorithms are simple and compute quickly.
Collapse
Affiliation(s)
- Sten Ternström
- Department of Speech, Music and Hearing, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm,
| |
Collapse
|
6
|
Nacci A, Romeo SO, Cavaliere MD, Macerata A, Bastiani L, Paludetti G, Galli J, Marchese MR, Barillari MR, Barillari U, Berrettini S, Laschi C, Cianchetti M, Manti M, Ursino F, Fattori B. Comparison of electroglottographic variability index in euphonic and pathological voice. ACTA ACUST UNITED AC 2019; 39:381-388. [PMID: 30745592 PMCID: PMC6966776 DOI: 10.14639/0392-100x-2127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2018] [Accepted: 03/26/2018] [Indexed: 11/23/2022]
Abstract
In a recent study we introduced a new approach for analysis of the electroglottographic (ECG) signal. This method is based on the evaluation of variation of the EGG signal and its first derivative, through new software developed by the Pisan phoniatric school. This software is designed to extract quantitative indices related to the contacting and decontacting phases of the vocal folds during phonation. The software allows us to study the combined variability of vibration amplitude and velocity (i.e. the first derivative of the EGG signal). Pathological voices show a much more variable EGG signal compared to normal voices, since cordal vibration is made irregular due to the presence of glottis plane pathologies. With the aim of demonstrating the differences between normal and pathological voices relevant to combined vibration amplitude and velocity variability, we have introduced a new quantitative parameter named “variability index, VI”. We studied 95 subjects (35 normal and 60 with pathological voice); among pathologic subjects, 15 showed functional dysphonia and 45 showed organic dysphonia. Subjects affected by organic dysphonia presented: 15 bilateral vocal nodules, 15 unilateral polyps and 15 unilateral cysts. All subjects were studied with videolaryngostroboscopy; electro-acoustic parameters of the voice were analysed with the KayPENTAX CSL (Model 4500) system. The EGG signal was recorded using KAY Model 6103 connected to the CSL system. The new software for the analysis of the EGG signal allows us to obtain not only a VI total value relevant to variability during all the recording, but also partial VI values relevant to the different glottis cycle phases. In fact, plotting the amplitude variation and its first derivative on a Lissajous graph, it is possible to divide the whole glottis cycle into four phases (each represented by four quadrants on the graph): the initial vocal folds contacting activity (VI-Q1), the last phase of vocal folds contacting (VI-Q2), the first phase of vocal folds decontacting (VI-Q3) and the last phase, up to the complete decontacting of vocal folds (VI-Q4). For each quadrant, it is also possible to work out the percent variability index. By comparing the variability indices in the normal and pathological groups, we obtained the following results: the total VI was significantly higher in the pathological subjects (0.25 vs 0.18; p = 0.01); the absolute value of VI was higher in pathological subjects, although the difference was not significant (VI-Q2, 0.041 vs 0.029; VI-Q3, 0.065 vs 0.058; VI-Q4, 0.054 vs 0.052). The percent variability in the Q2 quadrant (VI-Q2%) was significantly higher in pathological subjects compared to normal subjects (0.22 vs 0.16) (p = 0.01). The results of this study confirm that our new software for analysis of EGG signal can distinguish normal voice from pathological voice based on the new quantitative parameter VI. Moreover, this study emphasises that the final contact phase of vocal folds is the most representative of the difference between the normal and pathological voice and shows a wider variability in terms of amplitude and vibration velocity. Further studies on larger groups of subjects will be required to confirm these results and assess differences in the EGG signal among the various vocal fold pathologies.
Collapse
Affiliation(s)
- A Nacci
- ENT, Audiology and Phoniatrics Unit, University Hospital of Pisa, Italy
| | - S O Romeo
- ENT, Audiology and Phoniatrics Unit, University Hospital of Pisa, Italy
| | - M D Cavaliere
- ENT, Audiology and Phoniatrics Unit, University Hospital of Pisa, Italy
| | - A Macerata
- Department of Clinical and Experimental Medicine, University of Pisa, Italy
| | - L Bastiani
- Institute of Clinical Physiology of the Italian National Research Council (IFC-CNR), Pisa, Italy
| | - G Paludetti
- Institute of Otorhinolaryngology, Department of Head and Neck Surgery, Fondazione Policlinico Universitario A. Gemelli IRCCS, Roma - Università Cattolica del Sacro Cuore, Rome, Italy
| | - J Galli
- Institute of Otorhinolaryngology, Department of Head and Neck Surgery, Fondazione Policlinico Universitario A. Gemelli IRCCS, Roma - Università Cattolica del Sacro Cuore, Rome, Italy
| | - M R Marchese
- Institute of Otorhinolaryngology, Department of Head and Neck Surgery, Fondazione Policlinico Universitario A. Gemelli IRCCS, Rome, Italy
| | - M R Barillari
- Division of Phoniatrics and Audiology, Department of Mental and Physical Health and Preventive Medicine, University of Campania "Luigi Vanvitelli", Naples, Italy
| | - U Barillari
- Division of Phoniatrics and Audiology, Department of Mental and Physical Health and Preventive Medicine, University of Campania "Luigi Vanvitelli", Naples, Italy
| | - S Berrettini
- ENT, Audiology and Phoniatrics Unit, University Hospital of Pisa, Italy.,Division of ENT Diseases, Karolinska Institutet, Stockholm, Sweden
| | - C Laschi
- The BioRobotics Institute, Scuola Superiore Sant'Anna, Pisa, Italy
| | - M Cianchetti
- The BioRobotics Institute, Scuola Superiore Sant'Anna, Pisa, Italy
| | - M Manti
- The BioRobotics Institute, Scuola Superiore Sant'Anna, Pisa, Italy
| | - F Ursino
- National Institute for Research in Phoniatrics, University of Pisa, Italy
| | - B Fattori
- ENT, Audiology and Phoniatrics Unit, University Hospital of Pisa, Italy
| |
Collapse
|