1
|
Chong CS, Davis C, Kim J. A Cantonese Audio-Visual Emotional Speech (CAVES) dataset. Behav Res Methods 2024; 56:5264-5278. [PMID: 38017201 PMCID: PMC11289252 DOI: 10.3758/s13428-023-02270-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/11/2023] [Indexed: 11/30/2023]
Abstract
We present a Cantonese emotional speech dataset that is suitable for use in research investigating the auditory and visual expression of emotion in tonal languages. This unique dataset consists of auditory and visual recordings of ten native speakers of Cantonese uttering 50 sentences each in the six basic emotions plus neutral (angry, happy, sad, surprise, fear, and disgust). The visual recordings have a full HD resolution of 1920 × 1080 pixels and were recorded at 50 fps. The important features of the dataset are outlined along with the factors considered when compiling the dataset. A validation study of the recorded emotion expressions was conducted in which 15 native Cantonese perceivers completed a forced-choice emotion identification task. The variability of the speakers and the sentences was examined by testing the degree of concordance between the intended and the perceived emotion. We compared these results with those of other emotion perception and evaluation studies that have tested spoken emotions in languages other than Cantonese. The dataset is freely available for research purposes.
Collapse
Affiliation(s)
- Chee Seng Chong
- The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Locked Bag 1797, Penrith, NSW, 2751, Australia
| | - Chris Davis
- The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Locked Bag 1797, Penrith, NSW, 2751, Australia
| | - Jeesun Kim
- The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Locked Bag 1797, Penrith, NSW, 2751, Australia.
| |
Collapse
|
2
|
von Eiff CI, Kauk J, Schweinberger SR. The Jena Audiovisual Stimuli of Morphed Emotional Pseudospeech (JAVMEPS): A database for emotional auditory-only, visual-only, and congruent and incongruent audiovisual voice and dynamic face stimuli with varying voice intensities. Behav Res Methods 2024; 56:5103-5115. [PMID: 37821750 PMCID: PMC11289065 DOI: 10.3758/s13428-023-02249-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/18/2023] [Indexed: 10/13/2023]
Abstract
We describe JAVMEPS, an audiovisual (AV) database for emotional voice and dynamic face stimuli, with voices varying in emotional intensity. JAVMEPS includes 2256 stimulus files comprising (A) recordings of 12 speakers, speaking four bisyllabic pseudowords with six naturalistic induced basic emotions plus neutral, in auditory-only, visual-only, and congruent AV conditions. It furthermore comprises (B) caricatures (140%), original voices (100%), and anti-caricatures (60%) for happy, fearful, angry, sad, disgusted, and surprised voices for eight speakers and two pseudowords. Crucially, JAVMEPS contains (C) precisely time-synchronized congruent and incongruent AV (and corresponding auditory-only) stimuli with two emotions (anger, surprise), (C1) with original intensity (ten speakers, four pseudowords), (C2) and with graded AV congruence (implemented via five voice morph levels, from caricatures to anti-caricatures; eight speakers, two pseudowords). We collected classification data for Stimulus Set A from 22 normal-hearing listeners and four cochlear implant users, for two pseudowords, in auditory-only, visual-only, and AV conditions. Normal-hearing individuals showed good classification performance (McorrAV = .59 to .92), with classification rates in the auditory-only condition ≥ .38 correct (surprise: .67, anger: .51). Despite compromised vocal emotion perception, CI users performed above chance levels of .14 for auditory-only stimuli, with best rates for surprise (.31) and anger (.30). We anticipate JAVMEPS to become a useful open resource for researchers into auditory emotion perception, especially when adaptive testing or calibration of task difficulty is desirable. With its time-synchronized congruent and incongruent stimuli, JAVMEPS can also contribute to filling a gap in research regarding dynamic audiovisual integration of emotion perception via behavioral or neurophysiological recordings.
Collapse
Affiliation(s)
- Celina I von Eiff
- Department for General Psychology and Cognitive Neuroscience, Institute of Psychology, Friedrich Schiller University Jena, Am Steiger 3, 07743, Jena, Germany.
- Voice Research Unit, Institute of Psychology, Friedrich Schiller University Jena, Leutragraben 1, 07743, Jena, Germany.
- DFG SPP 2392 Visual Communication (ViCom), Frankfurt am Main, Germany.
- Jena University Hospital, 07747, Jena, Germany.
| | - Julian Kauk
- Department for General Psychology and Cognitive Neuroscience, Institute of Psychology, Friedrich Schiller University Jena, Am Steiger 3, 07743, Jena, Germany
| | - Stefan R Schweinberger
- Department for General Psychology and Cognitive Neuroscience, Institute of Psychology, Friedrich Schiller University Jena, Am Steiger 3, 07743, Jena, Germany.
- Voice Research Unit, Institute of Psychology, Friedrich Schiller University Jena, Leutragraben 1, 07743, Jena, Germany.
- DFG SPP 2392 Visual Communication (ViCom), Frankfurt am Main, Germany.
- Jena University Hospital, 07747, Jena, Germany.
| |
Collapse
|
3
|
Yue L, Hu P, Zhu J. Advanced differential evolution for gender-aware English speech emotion recognition. Sci Rep 2024; 14:17696. [PMID: 39085418 PMCID: PMC11291894 DOI: 10.1038/s41598-024-68864-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 07/29/2024] [Indexed: 08/02/2024] Open
Abstract
Speech emotion recognition (SER) technology involves feature extraction and prediction models. However, recognition efficiency tends to decrease because of gender differences and the large number of extracted features. Consequently, this paper introduces a SER system based on gender. First, gender and emotion features are extracted from speech signals to develop gender recognition and emotion classification models. Second, according to gender differences, distinct emotion recognition models are established for male and female speakers. The gender of speakers is determined before executing the corresponding emotion model. Third, the accuracy of these emotion models is enhanced by utilizing an advanced differential evolution algorithm (ADE) to select optimal features. ADE incorporates new difference vectors, mutation operators, and position learning, which effectively balance global and local searches. A new position repairing method is proposed to address gender differences. Finally, experiments on four English datasets demonstrate that ADE is superior to comparison algorithms in recognition accuracy, recall, precision, F1-score, the number of used features and execution time. The findings highlight the significance of gender in refining emotion models, while mel-frequency cepstral coefficients are important factors in gender differences.
Collapse
Affiliation(s)
- Liya Yue
- Fanli Business School, Nanyang Institute of Technology, Nanyang, 473004, China
| | - Pei Hu
- School of Computer and Software, Nanyang Institute of Technology, Nanyang, 473004, China
| | - Jiulong Zhu
- Fanli Business School, Nanyang Institute of Technology, Nanyang, 473004, China.
| |
Collapse
|
4
|
Alroobaea R. Cross-corpus speech emotion recognition with transformers: Leveraging handcrafted features and data augmentation. Comput Biol Med 2024; 179:108841. [PMID: 39002317 DOI: 10.1016/j.compbiomed.2024.108841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Revised: 04/16/2024] [Accepted: 05/06/2024] [Indexed: 07/15/2024]
Abstract
Speech emotion recognition (SER) stands as a prominent and dynamic research field in data science due to its extensive application in various domains such as psychological assessment, mobile services, and computer games, mobile services. In previous research, numerous studies utilized manually engineered features for emotion classification, resulting in commendable accuracy. However, these features tend to underperform in complex scenarios, leading to reduced classification accuracy. These scenarios include: 1. Datasets that contain diverse speech patterns, dialects, accents, or variations in emotional expressions. 2. Data with background noise. 3. Scenarios where the distribution of emotions varies significantly across datasets can be challenging. 4. Combining datasets from different sources introduce complexities due to variations in recording conditions, data quality, and emotional expressions. Consequently, there is a need to improve the classification performance of SER techniques. To address this, a novel SER framework was introduced in this study. Prior to feature extraction, signal preprocessing and data augmentation methods were applied to augment the available data, resulting in the derivation of 18 informative features from each signal. The discriminative feature set was obtained using feature selection techniques which was then utilized as input for emotion recognition using the SAVEE, RAVDESS, and EMO-DB datasets. Furthermore, this research also implemented a cross-corpus model that incorporated all speech files related to common emotions from three datasets. The experimental outcomes demonstrated the superior performance of SER framework compared to existing frameworks in the field. Notably, the framework presented in this study achieved remarkable accuracy rates across various datasets. Specifically, the proposed model obtained an accuracy of 95%, 94%,97%, and 97% on SAVEE, RAVDESS, EMO-DB and cross-corpus datasets respectively. These results underscore the significant contribution of our proposed framework to the field of SER.
Collapse
Affiliation(s)
- Roobaea Alroobaea
- Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia.
| |
Collapse
|
5
|
Munsif M, Sajjad M, Ullah M, Tarekegn AN, Cheikh FA, Tsakanikas P, Muhammad K. Optimized efficient attention-based network for facial expressions analysis in neurological health care. Comput Biol Med 2024; 179:108822. [PMID: 38986286 DOI: 10.1016/j.compbiomed.2024.108822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 06/25/2024] [Accepted: 06/25/2024] [Indexed: 07/12/2024]
Abstract
Facial Expression Analysis (FEA) plays a vital role in diagnosing and treating early-stage neurological disorders (NDs) like Alzheimer's and Parkinson's. Manual FEA is hindered by expertise, time, and training requirements, while automatic methods confront difficulties with real patient data unavailability, high computations, and irrelevant feature extraction. To address these challenges, this paper proposes a novel approach: an efficient, lightweight convolutional block attention module (CBAM) based deep learning network (DLN) to aid doctors in diagnosing ND patients. The method comprises two stages: data collection of real ND patients, and pre-processing, involving face detection and an attention-enhanced DLN for feature extraction and refinement. Extensive experiments with validation on real patient data showcase compelling performance, achieving an accuracy of up to 73.2%. Despite its efficacy, the proposed model is lightweight, occupying only 3MB, making it suitable for deployment on resource-constrained mobile healthcare devices. Moreover, the method exhibits significant advancements over existing FEA approaches, holding tremendous promise in effectively diagnosing and treating ND patients. By accurately recognizing emotions and extracting relevant features, this approach empowers medical professionals in early ND detection and management, overcoming the challenges of manual analysis and heavy models. In conclusion, this research presents a significant leap in FEA, promising to enhance ND diagnosis and care.The code and data used in this work are available at: https://github.com/munsif200/Neurological-Health-Care.
Collapse
Affiliation(s)
| | - Muhammad Sajjad
- Digital Image Processing Lab, Department of Computer Science, Islamia College, Peshawar, 25000, Pakistan; Department of Computer Science, Norwegian University for Science and Technology, 2815, Gjøvik, Norway
| | - Mohib Ullah
- Intelligent Systems and Analytics Research Group (ISA), Department of Computer Science, Norwegian University for Science and Technology, 2815, Gjøvik, Norway
| | - Adane Nega Tarekegn
- Department of Computer Science, Norwegian University for Science and Technology, 2815, Gjøvik, Norway
| | - Faouzi Alaya Cheikh
- Department of Computer Science, Norwegian University for Science and Technology, 2815, Gjøvik, Norway
| | - Panagiotis Tsakanikas
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece
| | - Khan Muhammad
- Visual Analytics for Knowledge Laboratory (VIS2KNOW Lab), Department of Applied Artificial Intelligence, School of Convergence, College of Computing and Informatics, Sungkyunkwan University, Seoul 03063, Republic of Korea.
| |
Collapse
|
6
|
Becker C, Conduit R, Chouinard PA, Laycock R. Can deepfakes be used to study emotion perception? A comparison of dynamic face stimuli. Behav Res Methods 2024:10.3758/s13428-024-02443-y. [PMID: 38834812 DOI: 10.3758/s13428-024-02443-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/11/2024] [Indexed: 06/06/2024]
Abstract
Video recordings accurately capture facial expression movements; however, they are difficult for face perception researchers to standardise and manipulate. For this reason, dynamic morphs of photographs are often used, despite their lack of naturalistic facial motion. This study aimed to investigate how humans perceive emotions from faces using real videos and two different approaches to artificially generating dynamic expressions - dynamic morphs, and AI-synthesised deepfakes. Our participants perceived dynamic morphed expressions as less intense when compared with videos (all emotions) and deepfakes (fearful, happy, sad). Videos and deepfakes were perceived similarly. Additionally, they perceived morphed happiness and sadness, but not morphed anger or fear, as less genuine than other formats. Our findings support previous research indicating that social responses to morphed emotions are not representative of those to video recordings. The findings also suggest that deepfakes may offer a more suitable standardized stimulus type compared to morphs. Additionally, qualitative data were collected from participants and analysed using ChatGPT, a large language model. ChatGPT successfully identified themes in the data consistent with those identified by an independent human researcher. According to this analysis, our participants perceived dynamic morphs as less natural compared with videos and deepfakes. That participants perceived deepfakes and videos similarly suggests that deepfakes effectively replicate natural facial movements, making them a promising alternative for face perception research. The study contributes to the growing body of research exploring the usefulness of generative artificial intelligence for advancing the study of human perception.
Collapse
|
7
|
Thomas AL, Assmann PF. Speech production and perception data collection in R: A tutorial for web-based methods using speechcollectr. Behav Res Methods 2024:10.3758/s13428-024-02399-z. [PMID: 38829553 DOI: 10.3758/s13428-024-02399-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/11/2024] [Indexed: 06/05/2024]
Abstract
This tutorial is designed for speech scientists familiar with the R programming language who wish to construct experiment interfaces in R. We begin by discussing some of the benefits of building experiment interfaces in R-including R's existing tools for speech data analysis, platform independence, suitability for web-based testing, and the fact that R is open source. We explain basic concepts of reactive programming in R, and we apply these principles by detailing the development of two sample experiments. The first of these experiments comprises a speech production task in which participants are asked to read words with different emotions. The second sample experiment involves a speech perception task, in which participants listen to recorded speech and identify the emotion the talker expressed with forced-choice questions and confidence ratings. Throughout this tutorial, we introduce the new R package speechcollectr, which provides functions uniquely suited to web-based speech data collection. The package streamlines the code required for speech experiments by providing functions for common tasks like documenting participant consent, collecting participant demographic information, recording audio, checking the adequacy of a participant's microphone or headphones, and presenting audio stimuli. Finally, we describe some of the difficulties of remote speech data collection, along with the solutions we have incorporated into speechcollectr to meet these challenges.
Collapse
Affiliation(s)
- Abbey L Thomas
- School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, TX, USA.
| | - Peter F Assmann
- School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
8
|
Cooper A, Eitel M, Fecher N, Johnson E, Cirelli LK. Who is singing? Voice recognition from spoken versus sung speech. JASA EXPRESS LETTERS 2024; 4:065203. [PMID: 38888432 DOI: 10.1121/10.0026385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 06/03/2024] [Indexed: 06/20/2024]
Abstract
Singing is socially important but constrains voice acoustics, potentially masking certain aspects of vocal identity. Little is known about how well listeners extract talker details from sung speech or identify talkers across the sung and spoken modalities. Here, listeners (n = 149) were trained to recognize sung or spoken voices and then tested on their identification of these voices in both modalities. Learning vocal identities was initially easier through speech than song. At test, cross-modality voice recognition was above chance, but weaker than within-modality recognition. We conclude that talker information is accessible in sung speech, despite acoustic constraints in song.
Collapse
Affiliation(s)
- Angela Cooper
- Department of Psychology, University of Toronto Mississauga, Mississauga, Ontario, Canada
| | - Matthew Eitel
- Department of Psychology, University of Toronto Scarborough, Toronto, Ontario, , , , ,
| | - Natalie Fecher
- Department of Psychology, University of Toronto Mississauga, Mississauga, Ontario, Canada
| | - Elizabeth Johnson
- Department of Psychology, University of Toronto Mississauga, Mississauga, Ontario, Canada
| | - Laura K Cirelli
- Department of Psychology, University of Toronto Scarborough, Toronto, Ontario, , , , ,
| |
Collapse
|
9
|
Wurzberger F, Schwenker F. Learning in Deep Radial Basis Function Networks. ENTROPY (BASEL, SWITZERLAND) 2024; 26:368. [PMID: 38785617 PMCID: PMC11120405 DOI: 10.3390/e26050368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 04/19/2024] [Accepted: 04/24/2024] [Indexed: 05/25/2024]
Abstract
Learning in neural networks with locally-tuned neuron models such as radial Basis Function (RBF) networks is often seen as instable, in particular when multi-layered architectures are used. Furthermore, universal approximation theorems for single-layered RBF networks are very well established; therefore, deeper architectures are theoretically not required. Consequently, RBFs are mostly used in a single-layered manner. However, deep neural networks have proven their effectiveness on many different tasks. In this paper, we show that deeper RBF architectures with multiple radial basis function layers can be designed together with efficient learning schemes. We introduce an initialization scheme for deep RBF networks based on k-means clustering and covariance estimation. We further show how to make use of convolutions to speed up the calculation of the Mahalanobis distance in a partially connected way, which is similar to the convolutional neural networks (CNNs). Finally, we evaluate our approach on image classification as well as speech emotion recognition tasks. Our results show that deep RBF networks perform very well, with comparable results to other deep neural network types, such as CNNs.
Collapse
Affiliation(s)
- Fabian Wurzberger
- Institute of Neural Information Processing, Ulm University, James-Franck-Ring, 89081 Ulm, Germany
| | - Friedhelm Schwenker
- Institute of Neural Information Processing, Ulm University, James-Franck-Ring, 89081 Ulm, Germany
| |
Collapse
|
10
|
Wu D, Jia X, Rao W, Dou W, Li Y, Li B. Construction of a Chinese traditional instrumental music dataset: A validated set of naturalistic affective music excerpts. Behav Res Methods 2024; 56:3757-3778. [PMID: 38702502 PMCID: PMC11133124 DOI: 10.3758/s13428-024-02411-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/22/2024] [Indexed: 05/06/2024]
Abstract
Music is omnipresent among human cultures and moves us both physically and emotionally. The perception of emotions in music is influenced by both psychophysical and cultural factors. Chinese traditional instrumental music differs significantly from Western music in cultural origin and music elements. However, previous studies on music emotion perception are based almost exclusively on Western music. Therefore, the construction of a dataset of Chinese traditional instrumental music is important for exploring the perception of music emotions in the context of Chinese culture. The present dataset included 273 10-second naturalistic music excerpts. We provided rating data for each excerpt on ten variables: familiarity, dimensional emotions (valence and arousal), and discrete emotions (anger, gentleness, happiness, peacefulness, sadness, solemnness, and transcendence). The excerpts were rated by a total of 168 participants on a seven-point Likert scale for the ten variables. Three labels for the excerpts were obtained: familiarity, discrete emotion, and cluster. Our dataset demonstrates good reliability, and we believe it could contribute to cross-cultural studies on emotional responses to music.
Collapse
Affiliation(s)
- Di Wu
- Institute of Brain Science and Department of Physiology, School of Basic Medical Sciences, Hangzhou Normal University, Hangzhou, 311121, China
- Zhejiang Philosophy and Social Science Laboratory for Research in Early Development and Childcare, Hangzhou Normal University, Hangzhou, 311121, China
| | - Xi Jia
- Institute of Brain Science and Department of Physiology, School of Basic Medical Sciences, Hangzhou Normal University, Hangzhou, 311121, China
- Zhejiang Philosophy and Social Science Laboratory for Research in Early Development and Childcare, Hangzhou Normal University, Hangzhou, 311121, China
| | - Wenxin Rao
- Institute of Brain Science and Department of Physiology, School of Basic Medical Sciences, Hangzhou Normal University, Hangzhou, 311121, China
| | - Wenjie Dou
- Institute of Brain Science and Department of Physiology, School of Basic Medical Sciences, Hangzhou Normal University, Hangzhou, 311121, China
- Zhejiang Philosophy and Social Science Laboratory for Research in Early Development and Childcare, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yangping Li
- Institute of Brain Science and Department of Physiology, School of Basic Medical Sciences, Hangzhou Normal University, Hangzhou, 311121, China
- School of Foreign Studies, Xi'an Jiaotong University, Xi'an, 710049, China
| | - Baoming Li
- Institute of Brain Science and Department of Physiology, School of Basic Medical Sciences, Hangzhou Normal University, Hangzhou, 311121, China.
- Zhejiang Philosophy and Social Science Laboratory for Research in Early Development and Childcare, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|
11
|
Kim HN, Taylor S. Differences of people with visual disabilities in the perceived intensity of emotion inferred from speech of sighted people in online communication settings. Disabil Rehabil Assist Technol 2024; 19:633-640. [PMID: 35997772 DOI: 10.1080/17483107.2022.2114555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 06/17/2022] [Accepted: 08/12/2022] [Indexed: 10/15/2022]
Abstract
PURPOSE As humans convey information about emotions by speech signals, emotion recognition via auditory information is often employed to assess one's affective states. There are numerous ways of applying the knowledge of emotional vocal expressions to system designs that accommodate users' needs adequately. Yet, little is known about how people with visual disabilities infer emotions from speech stimuli, especially via online platforms (e.g., Zoom). This study focussed on examining the degree to which they perceive emotions strongly or weakly, i.e., perceived intensity but also investigating the degree to which their sociodemographic backgrounds affect them perceiving different intensity levels of emotions when exposed to a set of emotional speech stimuli via Zoom. MATERIALS AND METHODS A convenience sample of 30 individuals with visual disabilities participated in zoom interviews. Participants were given a set of emotional speech stimuli and reported the intensity level of the perceived emotions on a rating scale from 1 (weak) to 8 (strong). RESULTS When the participants were exposed to the emotional speech stimuli, calm, happy, fearful, sad, and neutral, they reported that neutral was the dominant emotion they perceived with the greatest intensity. Individual differences were also observed in the perceived intensity of emotions, associated with sociodemographic backgrounds, such as health, vision, job, and age. CONCLUSIONS The results of this study are anticipated to contribute to the fundamental knowledge that will be helpful for many stakeholders such as voice technology engineers, user experience designers, health professionals, and social workers providing support to people with visual disabilities.IMPLICATIONS FOR REHABILITATIONTechnologies equipped with alternative user interfaces (e.g., Siri, Alexa, and Google Voice Assistant) meeting the needs of people with visual disabilities can promote independent living and quality of life.Such technologies can also be equipped with systems that can recognize emotions via users' voice, such that users can obtain services customized to fit their emotional needs or adequately address their emotional challenges (e.g., early detection of onset, provision of advice, and so on).The results of this study can be beneficial to health professionals (e.g., social workers) who work closely with clients who have visual disabilities (e.g., virtual telehealth sessions) as they could gain insights or learn how to recognize and understand the clients' emotional struggle by hearing their voice, which is contributing to enhancement of emotional intelligence. Thus, they can provide better services to their clients, leading to building a strong bond and trust between health professionals and clients with visual disabilities even they meet virtually (e.g., Zoom).
Collapse
Affiliation(s)
- Hyung Nam Kim
- North Carolina A&T State University, Greensboro, NC, USA
| | - Shaniah Taylor
- North Carolina A&T State University, Greensboro, NC, USA
| |
Collapse
|
12
|
Leung FYN, Stojanovik V, Jiang C, Liu F. Investigating implicit emotion processing in autism spectrum disorder across age groups: A cross-modal emotional priming study. Autism Res 2024; 17:824-837. [PMID: 38488319 DOI: 10.1002/aur.3124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Accepted: 03/01/2024] [Indexed: 04/13/2024]
Abstract
Cumulating evidence suggests that atypical emotion processing in autism may generalize across different stimulus domains. However, this evidence comes from studies examining explicit emotion recognition. It remains unclear whether domain-general atypicality also applies to implicit emotion processing in autism and its implication for real-world social communication. To investigate this, we employed a novel cross-modal emotional priming task to assess implicit emotion processing of spoken/sung words (primes) through their influence on subsequent emotional judgment of faces/face-like objects (targets). We assessed whether implicit emotional priming differed between 38 autistic and 38 neurotypical individuals across age groups as a function of prime and target type. Results indicated no overall group differences across age groups, prime types, and target types. However, differential, domain-specific developmental patterns emerged for the autism and neurotypical groups. For neurotypical individuals, speech but not song primed the emotional judgment of faces across ages. This speech-orienting tendency was not observed across ages in the autism group, as priming of speech on faces was not seen in autistic adults. These results outline the importance of the delicate weighting between speech- versus song-orientation in implicit emotion processing throughout development, providing more nuanced insights into the emotion processing profile of autistic individuals.
Collapse
Affiliation(s)
- Florence Y N Leung
- School of Psychology and Clinical Language Sciences, University of Reading, Reading, UK
- Department of Psychology, University of Bath, Bath, UK
| | - Vesna Stojanovik
- School of Psychology and Clinical Language Sciences, University of Reading, Reading, UK
| | - Cunmei Jiang
- Music College, Shanghai Normal University, Shanghai, China
| | - Fang Liu
- School of Psychology and Clinical Language Sciences, University of Reading, Reading, UK
| |
Collapse
|
13
|
Krumpholz C, Quigley C, Fusani L, Leder H. Vienna Talking Faces (ViTaFa): A multimodal person database with synchronized videos, images, and voices. Behav Res Methods 2024; 56:2923-2940. [PMID: 37950115 PMCID: PMC11133183 DOI: 10.3758/s13428-023-02264-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/02/2023] [Indexed: 11/12/2023]
Abstract
Social perception relies on different sensory channels, including vision and audition, which are specifically important for judgements of appearance. Therefore, to understand multimodal integration in person perception, it is important to study both face and voice in a synchronized form. We introduce the Vienna Talking Faces (ViTaFa) database, a high-quality audiovisual database focused on multimodal research of social perception. ViTaFa includes different stimulus modalities: audiovisual dynamic, visual dynamic, visual static, and auditory dynamic. Stimuli were recorded and edited under highly standardized conditions and were collected from 40 real individuals, and the sample matches typical student samples in psychological research (young individuals aged 18 to 45). Stimuli include sequences of various types of spoken content from each person, including German sentences, words, reading passages, vowels, and language-unrelated pseudo-words. Recordings were made with different emotional expressions (neutral, happy, angry, sad, and flirtatious). ViTaFa is freely accessible for academic non-profit research after signing a confidentiality agreement form via https://osf.io/9jtzx/ and stands out from other databases due to its multimodal format, high quality, and comprehensive quantification of stimulus features and human judgements related to attractiveness. Additionally, over 200 human raters validated emotion expression of the stimuli. In summary, ViTaFa provides a valuable resource for investigating audiovisual signals of social perception.
Collapse
Affiliation(s)
- Christina Krumpholz
- Department of Cognition, Emotion, and Methods in Psychology, Faculty of Psychology, University of Vienna, Liebiggasse 5, 1010, Vienna, Austria
- Konrad Lorenz Institute of Ethology, University of Veterinary Medicine, Vienna, Austria
- Department of Behavioural and Cognitive Biology, University of Vienna, Vienna, Austria
| | - Cliodhna Quigley
- Department of Behavioural and Cognitive Biology, University of Vienna, Vienna, Austria
- Vienna Cognitive Science Hub, University of Vienna, Vienna, Austria
| | - Leonida Fusani
- Konrad Lorenz Institute of Ethology, University of Veterinary Medicine, Vienna, Austria
- Department of Behavioural and Cognitive Biology, University of Vienna, Vienna, Austria
- Vienna Cognitive Science Hub, University of Vienna, Vienna, Austria
| | - Helmut Leder
- Department of Cognition, Emotion, and Methods in Psychology, Faculty of Psychology, University of Vienna, Liebiggasse 5, 1010, Vienna, Austria.
- Vienna Cognitive Science Hub, University of Vienna, Vienna, Austria.
| |
Collapse
|
14
|
Sadok S, Leglaive S, Girin L, Alameda-Pineda X, Séguier R. A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Netw 2024; 172:106120. [PMID: 38266474 DOI: 10.1016/j.neunet.2024.106120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 10/25/2023] [Accepted: 01/09/2024] [Indexed: 01/26/2024]
Abstract
High-dimensional data such as natural images or speech signals exhibit some form of regularity, preventing their dimensions from varying independently. This suggests that there exists a lower dimensional latent representation from which the high-dimensional observed data were generated. Uncovering the hidden explanatory features of complex data is the goal of representation learning, and deep latent variable generative models have emerged as promising unsupervised approaches. In particular, the variational autoencoder (VAE) which is equipped with both a generative and an inference model allows for the analysis, transformation, and generation of various types of data. Over the past few years, the VAE has been extended to deal with data that are either multimodal or dynamical (i.e., sequential). In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audiovisual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.
Collapse
Affiliation(s)
| | | | - Laurent Girin
- Univ. Grenoble Alpes CNRS, Grenoble-INP, GIPSA-lab, France
| | | | | |
Collapse
|
15
|
Hsu JH, Wu CH, Lin ECL, Chen PS. MoodSensing: A smartphone app for digital phenotyping and assessment of bipolar disorder. Psychiatry Res 2024; 334:115790. [PMID: 38401488 DOI: 10.1016/j.psychres.2024.115790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 01/29/2024] [Accepted: 02/11/2024] [Indexed: 02/26/2024]
Abstract
BACKGROUND Daily life tracking has proven to be of great help in the assessment of patients with bipolar disorder. Although there are many smartphone apps for tracking bipolar disorder, most of them lack academic verification, privacy policy and long-term maintenance. METHODS Our developed app, MoodSensing, aims to collect users' digital phenotyping for assessment of bipolar disorder. The data collection was approved by the Institutional Review Board. This study collaborated with professional clinicians to ensure that the app meets both clinical needs and user experience requirements. Based on the collected digital phenotyping, deep learning techniques were applied to forecast participants' weekly HAM-D and YMRS scale scores. RESULTS In experiments, the data collected by our app can effectively predict the scale scores, reaching the mean absolute error of 0.84 and 0.22 on the scales. The statistical data also demonstrate the increase in user engagement. CONCLUSIONS Our analysis reveals that the developed MoodSensing app can not only provide a good user experience, but also the recorded data have certain discriminability for clinical assessment. Our app also provides relevant policies to protect user privacy, and has been launched in the Apple Store and Google Play Store.
Collapse
Affiliation(s)
- Jia-Hao Hsu
- Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan
| | - Chung-Hsien Wu
- Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan.
| | | | - Po-See Chen
- Institute of Behavioral Medicine, College of Medicine, National Cheng Kung University, Taiwan
| |
Collapse
|
16
|
Diemerling H, Stresemann L, Braun T, von Oertzen T. Implementing machine learning techniques for continuous emotion prediction from uniformly segmented voice recordings. Front Psychol 2024; 15:1300996. [PMID: 38572198 PMCID: PMC10987695 DOI: 10.3389/fpsyg.2024.1300996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Accepted: 02/09/2024] [Indexed: 04/05/2024] Open
Abstract
Introduction Emotional recognition from audio recordings is a rapidly advancing field, with significant implications for artificial intelligence and human-computer interaction. This study introduces a novel method for detecting emotions from short, 1.5 s audio samples, aiming to improve accuracy and efficiency in emotion recognition technologies. Methods We utilized 1,510 unique audio samples from two databases in German and English to train our models. We extracted various features for emotion prediction, employing Deep Neural Networks (DNN) for general feature analysis, Convolutional Neural Networks (CNN) for spectrogram analysis, and a hybrid model combining both approaches (C-DNN). The study addressed challenges associated with dataset heterogeneity, language differences, and the complexities of audio sample trimming. Results Our models demonstrated accuracy significantly surpassing random guessing, aligning closely with human evaluative benchmarks. This indicates the effectiveness of our approach in recognizing emotional states from brief audio clips. Discussion Despite the challenges of integrating diverse datasets and managing short audio samples, our findings suggest considerable potential for this methodology in real-time emotion detection from continuous speech. This could contribute to improving the emotional intelligence of AI and its applications in various areas.
Collapse
Affiliation(s)
- Hannes Diemerling
- Center for Lifespan Psychology, Max Planck Institute for Human Development, Berlin, Germany
- Thomas Bayes Institute, Berlin, Germany
- Department of Psychology, Humboldt-Universität zu Berlin, Berlin, Germany
- Department of Psychology, University of the Bundeswehr München, Neubiberg, Germany
| | - Leonie Stresemann
- Department of Psychology, University of the Bundeswehr München, Neubiberg, Germany
| | - Tina Braun
- Department of Psychology, University of the Bundeswehr München, Neubiberg, Germany
- Department of Psychology, Charlotte-Fresenius University, Wiesbaden, Germany
| | - Timo von Oertzen
- Center for Lifespan Psychology, Max Planck Institute for Human Development, Berlin, Germany
- Thomas Bayes Institute, Berlin, Germany
| |
Collapse
|
17
|
Lingelbach K, Vukelić M, Rieger JW. GAUDIE: Development, validation, and exploration of a naturalistic German AUDItory Emotional database. Behav Res Methods 2024; 56:2049-2063. [PMID: 37221343 PMCID: PMC10991051 DOI: 10.3758/s13428-023-02135-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/21/2023] [Indexed: 05/25/2023]
Abstract
Since thoroughly validated naturalistic affective German speech stimulus databases are rare, we present here a novel validated database of speech sequences assembled with the purpose of emotion induction. The database comprises 37 audio speech sequences with a total duration of 92 minutes for the induction of positive, neutral, and negative emotion: comedian shows intending to elicit humorous and amusing feelings, weather forecasts, and arguments between couples and relatives from movies or television series. Multiple continuous and discrete ratings are used to validate the database to capture the time course and variabilities of valence and arousal. We analyse and quantify how well the audio sequences fulfil quality criteria of differentiation, salience/strength, and generalizability across participants. Hence, we provide a validated speech database of naturalistic scenarios suitable to investigate emotion processing and its time course with German-speaking participants. Information on using the stimulus database for research purposes can be found at the OSF project repository GAUDIE: https://osf.io/xyr6j/ .
Collapse
Affiliation(s)
- Katharina Lingelbach
- Fraunhofer Institute for Industrial Engineering IAO, Nobelstraße 12, 70569, Stuttgart, Germany.
- Department of Psychology, University of Oldenburg, Oldenburg, Germany.
| | - Mathias Vukelić
- Fraunhofer Institute for Industrial Engineering IAO, Nobelstraße 12, 70569, Stuttgart, Germany
| | - Jochem W Rieger
- Department of Psychology, University of Oldenburg, Oldenburg, Germany
| |
Collapse
|
18
|
Cooper H, Jennings BJ, Kumari V, Willard AK, Bennetts RJ. The association between childhood trauma and emotion recognition is reduced or eliminated when controlling for alexithymia and psychopathy traits. Sci Rep 2024; 14:3413. [PMID: 38341493 PMCID: PMC10858958 DOI: 10.1038/s41598-024-53421-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 01/31/2024] [Indexed: 02/12/2024] Open
Abstract
Emotion recognition shows large inter-individual variability, and is substantially affected by childhood trauma as well as modality, emotion portrayed, and intensity. While research suggests childhood trauma influences emotion recognition, it is unclear whether this effect is consistent when controlling for interrelated individual differences. Further, the universality of the effects has not been explored, most studies have not examined differing modalities or intensities. This study examined childhood trauma's association with accuracy, when controlling for alexithymia and psychopathy traits, and if this varied across modality, emotion portrayed, and intensity. An adult sample (N = 122) completed childhood trauma, alexithymia, and psychopathy questionnaires and three emotion tasks: faces, voices, audio-visual. When investigating childhood trauma alone, there was a significant association with poorer accuracy when exploring modality, emotion portrayed, and intensity. When controlling for alexithymia and psychopathy, childhood trauma remained significant when exploring emotion portrayed, however, it was no longer significant when exploring modality and intensity. In fact, alexithymia was significant when exploring intensity. The effect sizes overall were small. Our findings suggest the importance of controlling for interrelated individual differences. Future research should explore more sensitive measures of emotion recognition, such as intensity ratings and sensitivity to intensity, to see if these follow accuracy findings.
Collapse
Affiliation(s)
- Holly Cooper
- Division of Psychology, College of Health, Medicine, and Life Sciences, Brunel University London, Uxbridge, UB8 3PH, UK.
| | - Ben J Jennings
- Division of Psychology, College of Health, Medicine, and Life Sciences, Brunel University London, Uxbridge, UB8 3PH, UK
| | - Veena Kumari
- Division of Psychology, College of Health, Medicine, and Life Sciences, Brunel University London, Uxbridge, UB8 3PH, UK
| | - Aiyana K Willard
- Division of Psychology, College of Health, Medicine, and Life Sciences, Brunel University London, Uxbridge, UB8 3PH, UK
| | - Rachel J Bennetts
- Division of Psychology, College of Health, Medicine, and Life Sciences, Brunel University London, Uxbridge, UB8 3PH, UK.
| |
Collapse
|
19
|
Islam B, McElwain NL, Li J, Davila MI, Hu Y, Hu K, Bodway JM, Dhekne A, Roy Choudhury R, Hasegawa-Johnson M. Preliminary Technical Validation of LittleBeats™: A Multimodal Sensing Platform to Capture Cardiac Physiology, Motion, and Vocalizations. SENSORS (BASEL, SWITZERLAND) 2024; 24:901. [PMID: 38339617 PMCID: PMC10857055 DOI: 10.3390/s24030901] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 01/19/2024] [Accepted: 01/19/2024] [Indexed: 02/12/2024]
Abstract
Across five studies, we present the preliminary technical validation of an infant-wearable platform, LittleBeats™, that integrates electrocardiogram (ECG), inertial measurement unit (IMU), and audio sensors. Each sensor modality is validated against data from gold-standard equipment using established algorithms and laboratory tasks. Interbeat interval (IBI) data obtained from the LittleBeats™ ECG sensor indicate acceptable mean absolute percent error rates for both adults (Study 1, N = 16) and infants (Study 2, N = 5) across low- and high-challenge sessions and expected patterns of change in respiratory sinus arrythmia (RSA). For automated activity recognition (upright vs. walk vs. glide vs. squat) using accelerometer data from the LittleBeats™ IMU (Study 3, N = 12 adults), performance was good to excellent, with smartphone (industry standard) data outperforming LittleBeats™ by less than 4 percentage points. Speech emotion recognition (Study 4, N = 8 adults) applied to LittleBeats™ versus smartphone audio data indicated a comparable performance, with no significant difference in error rates. On an automatic speech recognition task (Study 5, N = 12 adults), the best performing algorithm yielded relatively low word error rates, although LittleBeats™ (4.16%) versus smartphone (2.73%) error rates were somewhat higher. Together, these validation studies indicate that LittleBeats™ sensors yield a data quality that is largely comparable to those obtained from gold-standard devices and established protocols used in prior research.
Collapse
Affiliation(s)
- Bashima Islam
- Department of Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester, MA 01609, USA
| | - Nancy L. McElwain
- Department of Human Development and Family Studies, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; (Y.H.); (K.H.); (J.M.B.)
- Beckman Institute for Advanced Science and Technology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Jialu Li
- Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; (J.L.); (R.R.C.)
| | - Maria I. Davila
- Research Triangle Institute, Research Triangle Park, NC 27709, USA;
| | - Yannan Hu
- Department of Human Development and Family Studies, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; (Y.H.); (K.H.); (J.M.B.)
| | - Kexin Hu
- Department of Human Development and Family Studies, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; (Y.H.); (K.H.); (J.M.B.)
| | - Jordan M. Bodway
- Department of Human Development and Family Studies, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; (Y.H.); (K.H.); (J.M.B.)
| | - Ashutosh Dhekne
- School of Computer Science, Georgia Institute of Technology, Atlanta, GA 30332, USA;
| | - Romit Roy Choudhury
- Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; (J.L.); (R.R.C.)
| | - Mark Hasegawa-Johnson
- Beckman Institute for Advanced Science and Technology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; (J.L.); (R.R.C.)
| |
Collapse
|
20
|
Ge Y, Tang C, Li H, Chen Z, Wang J, Li W, Cooper J, Chetty K, Faccio D, Imran M, Abbasi QH. A comprehensive multimodal dataset for contactless lip reading and acoustic analysis. Sci Data 2023; 10:895. [PMID: 38092796 PMCID: PMC10719268 DOI: 10.1038/s41597-023-02793-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Accepted: 11/27/2023] [Indexed: 12/17/2023] Open
Abstract
Small-scale motion detection using non-invasive remote sensing techniques has recently garnered significant interest in the field of speech recognition. Our dataset paper aims to facilitate the enhancement and restoration of speech information from diverse data sources for speakers. In this paper, we introduce a novel multimodal dataset based on Radio Frequency, visual, text, audio, laser and lip landmark information, also called RVTALL. Specifically, the dataset consists of 7.5 GHz Channel Impulse Response (CIR) data from ultra-wideband (UWB) radars, 77 GHz frequency modulated continuous wave (FMCW) data from millimeter wave (mmWave) radar, visual and audio information, lip landmarks and laser data, offering a unique multimodal approach to speech recognition research. Meanwhile, a depth camera is adopted to record the landmarks of the subject's lip and voice. Approximately 400 minutes of annotated speech profiles are provided, which are collected from 20 participants speaking 5 vowels, 15 words, and 16 sentences. The dataset has been validated and has potential for the investigation of lip reading and multimodal speech recognition.
Collapse
Affiliation(s)
- Yao Ge
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Chong Tang
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK
- Department of Security and Crime Science, University College London, London, WC1E 6BT, UK
| | - Haobo Li
- School of Physics & Astronomy, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Zikang Chen
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Jingyan Wang
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Wenda Li
- School of Science and Engineering, University of Dundee, Dundee, DD1 4HN, UK
| | - Jonathan Cooper
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Kevin Chetty
- Department of Security and Crime Science, University College London, London, WC1E 6BT, UK
| | - Daniele Faccio
- School of Physics & Astronomy, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Muhammad Imran
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK
| | - Qammer H Abbasi
- James Watt School of Engineering, University of Glasgow, Glasgow, G12 8QQ, UK.
| |
Collapse
|
21
|
Billah MM, Sarker ML, Akhand M. KBES: A dataset for realistic Bangla speech emotion recognition with intensity level. Data Brief 2023; 51:109741. [PMID: 37965597 PMCID: PMC10641593 DOI: 10.1016/j.dib.2023.109741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 10/24/2023] [Accepted: 10/25/2023] [Indexed: 11/16/2023] Open
Abstract
Speech Emotion Recognition (SER) identifies and categorizes emotional states by analyzing speech signals. SER is an emerging research area using machine learning and deep learning techniques due to its socio-cultural and business importance. An appropriate dataset is an important resource for SER related studies in a particular language. There is an apparent lack of SER datasets in Bangla language although it is one of the most spoken languages in the world. There are a few Bangla SER datasets but those consist of only a few dialogs with a minimal number of actors making them unsuitable for real-world applications. Moreover, the existing datasets do not consider the intensity level of emotions. The intensity of a specific emotional expression, such as anger or sadness, plays a crucial role in social behavior. Therefore, a realistic Bangla speech dataset is developed in this study which is called KUET Bangla Emotional Speech (KBES) dataset. The dataset consists of 900 audio signals (i.e., speech dialogs) from 35 actors (20 females and 15 males) with diverse age ranges. Source of the speech dialogs are Bangla Telefilm, Drama, TV Series, Web Series. There are five emotional categories: Neutral, Happy, Sad, Angry, and Disgust. Except Neutral, samples of a particular emotion are divided into two intensity levels: Low and High. The significant issue of the dataset is that the speech dialogs are almost unique with relatively large number of actors; whereas, existing datasets (such as SUBESCO and BanglaSER) contain samples with repeatedly spoken of a few pre-defined dialogs by a few actors/research volunteers in the laboratory environment. Finally, the KBES dataset is exposed as a nine-class problem to classify emotions into nine categories: Neutral, Happy (Low), Happy (High), Sad (Low), Sad (High), Angry (Low), Angry (High), Disgust (Low) and Disgust (High). However, the dataset is kept symmetrical containing 100 samples for each of the nine classes; 100 samples are also gender balanced with 50 samples for male/female actors. The developed dataset seems a realistic dataset while compared with the existing SER datasets.
Collapse
Affiliation(s)
- Md. Masum Billah
- Department of Computer Science and Engineering, Khulna University of Engineering & Technology (KUET), Bangladesh
| | - Md. Likhon Sarker
- Department of Computer Science and Engineering, Khulna University of Engineering & Technology (KUET), Bangladesh
| | - M. A. H. Akhand
- Department of Computer Science and Engineering, Khulna University of Engineering & Technology (KUET), Bangladesh
| |
Collapse
|
22
|
Won NR, Son YD, Kim SM, Bae S, Kim JH, Kim JH, Han DH. Attention Circuits Mediate the Connection between Emotional Experience and Expression within the Emotional Circuit. CLINICAL PSYCHOPHARMACOLOGY AND NEUROSCIENCE : THE OFFICIAL SCIENTIFIC JOURNAL OF THE KOREAN COLLEGE OF NEUROPSYCHOPHARMACOLOGY 2023; 21:715-723. [PMID: 37859444 PMCID: PMC10591168 DOI: 10.9758/cpn.22.1029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/24/2022] [Accepted: 10/25/2022] [Indexed: 10/21/2023]
Abstract
Objective : Most affective neuroscience studies use pictures from the International Affective Picture System or standard facial expressions to elicit emotional experiences. The attention system, including the prefrontal cortex, can mediate emotional regulation in response to stimulation with emotional faces. We hypothesized that emotional experience is associated with brain activity within the neocortex. In addition, modification within the neocortex may be associated with brain activity within the attention system. Methods : Thirty-one healthy adult participants were recruited to be assessed for emotional expression using clinical scales of happiness, sadness, anxiety, and anger as and for emotional experience using brain activity in response to pictures of facial emotional expressions. The attention system was assessed using brain activity in response to the go-no-go task. Results : We found that emotional experience was associated with brain activity within the frontotemporal cortices, while emotional expression was associated with brain activity within the temporal and insular cortices. In addition, the association of brain activity between emotional experiences and expressions of sadness and anxiety was affected by brain activity within the anterior cingulate gyrus in response to the go-no-go task. Conclusion : Emotional expression may be associated with brain activity within the temporal cortex, whereas emotional experience may be associated with brain activity within the frontotemporal cortices. In addition, the attention system may interfere with the connection between emotional expression and experience.
Collapse
Affiliation(s)
- Na Rae Won
- Department of Psychiatry, Chung-Ang University Hospital, Seoul, Korea
| | - Young-Don Son
- Department of Biomedical Engineering, Gachon University, Seongnam, Korea
| | - Sun Mi Kim
- Department of Psychiatry, Chung-Ang University Hospital, Seoul, Korea
| | - Sujin Bae
- Department of Psychiatry, Chung-Ang University Hospital, Seoul, Korea
| | - Jeong Hee Kim
- Department of Biomedical Engineering, Gachon University, Seongnam, Korea
| | - Jong-Hoon Kim
- Department of Psychiatry, Gachon University Gil Medical Center, Gachon University College of Medicine, Incheon, Korea
| | - Doug Hyun Han
- Department of Psychiatry, Chung-Ang University Hospital, Seoul, Korea
| |
Collapse
|
23
|
Rezapour Mashhadi MM, Osei-Bonsu K. Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest. PLoS One 2023; 18:e0291500. [PMID: 37988352 PMCID: PMC10662716 DOI: 10.1371/journal.pone.0291500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 08/31/2023] [Indexed: 11/23/2023] Open
Abstract
Speech is a direct and rich way of transmitting information and emotions from one point to another. In this study, we aimed to classify different emotions in speech using various audio features and machine learning models. We extracted various types of audio features such as Mel-frequency cepstral coefficients, chromogram, Mel-scale spectrogram, spectral contrast feature, Tonnetz representation and zero-crossing rate. We used a limited dataset of speech emotion recognition (SER) and augmented it with additional audios. In addition, In contrast to many previous studies, we combined all audio files together before conducting our analysis. We compared the performance of two models: one-dimensional convolutional neural network (conv1D) and random forest (RF), with RF-based feature selection. Our results showed that RF with feature selection achieved higher average accuracy (69%) than conv1D and had the highest precision for fear (72%) and the highest recall for calm (84%). Our study demonstrates the effectiveness of RF with feature selection for speech emotion classification using a limited dataset. We found for both algorithms, anger is misclassified mostly with happy, disgust with sad and neutral, and fear with sad. This could be due to the similarity of some acoustic features between these emotions, such as pitch, intensity, and tempo.
Collapse
|
24
|
Li N, Ross R. Invoking and identifying task-oriented interlocutor confusion in human-robot interaction. Front Robot AI 2023; 10:1244381. [PMID: 38054199 PMCID: PMC10694506 DOI: 10.3389/frobt.2023.1244381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 10/31/2023] [Indexed: 12/07/2023] Open
Abstract
Successful conversational interaction with a social robot requires not only an assessment of a user's contribution to an interaction, but also awareness of their emotional and attitudinal states as the interaction unfolds. To this end, our research aims to systematically trigger, but then interpret human behaviors to track different states of potential user confusion in interaction so that systems can be primed to adjust their policies in light of users entering confusion states. In this paper, we present a detailed human-robot interaction study to prompt, investigate, and eventually detect confusion states in users. The study itself employs a Wizard-of-Oz (WoZ) style design with a Pepper robot to prompt confusion states for task-oriented dialogues in a well-defined manner. The data collected from 81 participants includes audio and visual data, from both the robot's perspective and the environment, as well as participant survey data. From these data, we evaluated the correlations of induced confusion conditions with multimodal data, including eye gaze estimation, head pose estimation, facial emotion detection, silence duration time, and user speech analysis-including emotion and pitch analysis. Analysis shows significant differences of participants' behaviors in states of confusion based on these signals, as well as a strong correlation between confusion conditions and participants own self-reported confusion scores. The paper establishes strong correlations between confusion levels and these observable features, and lays the ground or a more complete social and affect oriented strategy for task-oriented human-robot interaction. The contributions of this paper include the methodology applied, dataset, and our systematic analysis.
Collapse
Affiliation(s)
- Na Li
- School of Computer Science, Technological University, Dublin, Ireland
| | | |
Collapse
|
25
|
Franca M, Bolognini N, Brysbaert M. Seeing emotions in the eyes: a validated test to study individual differences in the perception of basic emotions. Cogn Res Princ Implic 2023; 8:67. [PMID: 37919608 PMCID: PMC10622392 DOI: 10.1186/s41235-023-00521-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 10/20/2023] [Indexed: 11/04/2023] Open
Abstract
People are able to perceive emotions in the eyes of others and can therefore see emotions when individuals wear face masks. Research has been hampered by the lack of a good test to measure basic emotions in the eyes. In two studies respectively with 358 and 200 participants, we developed a test to see anger, disgust, fear, happiness, sadness and surprise in images of eyes. Each emotion is measured with 8 stimuli (4 male actors and 4 female actors), matched in terms of difficulty and item discrimination. Participants reliably differed in their performance on the Seeing Emotions in the Eyes test (SEE-48). The test correlated well not only with Reading the Mind in the Eyes Test (RMET) but also with the Situational Test of Emotion Understanding (STEU), indicating that the SEE-48 not only measures low-level perceptual skills but also broader skills of emotion perception and emotional intelligence. The test is freely available for research and clinical purposes.
Collapse
Affiliation(s)
- Maria Franca
- Ph.D. Program in Neuroscience, School of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy
| | - Nadia Bolognini
- Department of Psychology and NeuroMI - Milan Centre for Neuroscience, University of Milano-Bicocca, Milan, Italy.
- Laboratory of Neuropsychology, Department of Neurorehabilitation Sciences, IRCCS Istituto Auxologico Italiano, Via Mercalli 32, 20122, Milan, Italy.
| | - Marc Brysbaert
- Department of Experimental Psychology, Ghent University, H. Dunantlaan 2, 9000, Ghent, Belgium.
| |
Collapse
|
26
|
Caulley D, Alemu Y, Burson S, Cárdenas Bautista E, Abebe Tadesse G, Kottmyer C, Aeschbach L, Cheungvivatpant B, Sezgin E. Objectively Quantifying Pediatric Psychiatric Severity Using Artificial Intelligence, Voice Recognition Technology, and Universal Emotions: Pilot Study for Artificial Intelligence-Enabled Innovation to Address Youth Mental Health Crisis. JMIR Res Protoc 2023; 12:e51912. [PMID: 37870890 PMCID: PMC10628686 DOI: 10.2196/51912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 09/14/2023] [Accepted: 09/18/2023] [Indexed: 10/24/2023] Open
Abstract
BACKGROUND Providing Psychotherapy, particularly for youth, is a pressing challenge in the health care system. Traditional methods are resource-intensive, and there is a need for objective benchmarks to guide therapeutic interventions. Automated emotion detection from speech, using artificial intelligence, presents an emerging approach to address these challenges. Speech can carry vital information about emotional states, which can be used to improve mental health care services, especially when the person is suffering. OBJECTIVE This study aims to develop and evaluate automated methods for detecting the intensity of emotions (anger, fear, sadness, and happiness) in audio recordings of patients' speech. We also demonstrate the viability of deploying the models. Our model was validated in a previous publication by Alemu et al with limited voice samples. This follow-up study used significantly more voice samples to validate the previous model. METHODS We used audio recordings of patients, specifically children with high adverse childhood experience (ACE) scores; the average ACE score was 5 or higher, at the highest risk for chronic disease and social or emotional problems; only 1 in 6 have a score of 4 or above. The patients' structured voice sample was collected by reading a fixed script. In total, 4 highly trained therapists classified audio segments based on a scoring process of 4 emotions and their intensity levels for each of the 4 different emotions. We experimented with various preprocessing methods, including denoising, voice-activity detection, and diarization. Additionally, we explored various model architectures, including convolutional neural networks (CNNs) and transformers. We trained emotion-specific transformer-based models and a generalized CNN-based model to predict emotion intensities. RESULTS The emotion-specific transformer-based model achieved a test-set precision and recall of 86% and 79%, respectively, for binary emotional intensity classification (high or low). In contrast, the CNN-based model, generalized to predict the intensity of 4 different emotions, achieved test-set precision and recall of 83% for each. CONCLUSIONS Automated emotion detection from patients' speech using artificial intelligence models is found to be feasible, leading to a high level of accuracy. The transformer-based model exhibited better performance in emotion-specific detection, while the CNN-based model showed promise in generalized emotion detection. These models can serve as valuable decision-support tools for pediatricians and mental health providers to triage youth to appropriate levels of mental health care services. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) RR1-10.2196/51912.
Collapse
Affiliation(s)
- Desmond Caulley
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Yared Alemu
- TQIntelligence, Inc, Atlanta, GA, United States
- Department of Psychiatry and Behavioral Sciences, Computational Psych Program, Morehouse School of Medicine, Atlanta, GA, United States
| | | | - Elizabeth Cárdenas Bautista
- TQIntelligence, Inc, Atlanta, GA, United States
- Department of Psychiatry and Behavioral Sciences, Computational Psych Program, Morehouse School of Medicine, Atlanta, GA, United States
| | | | - Christopher Kottmyer
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Laurent Aeschbach
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Bryan Cheungvivatpant
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Emre Sezgin
- Abigail Wexner Research Institute, Nationwide Children's Hospital, Columbus, OH, United States
| |
Collapse
|
27
|
Zhou D, Cheng Y, Wen L, Luo H, Liu Y. Drivers' Comprehensive Emotion Recognition Based on HAM. SENSORS (BASEL, SWITZERLAND) 2023; 23:8293. [PMID: 37837124 PMCID: PMC10574905 DOI: 10.3390/s23198293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 09/30/2023] [Accepted: 10/05/2023] [Indexed: 10/15/2023]
Abstract
Negative emotions of drivers may lead to some dangerous driving behaviors, which in turn lead to serious traffic accidents. However, most of the current studies on driver emotions use a single modality, such as EEG, eye trackers, and driving data. In complex situations, a single modality may not be able to fully consider a driver's complete emotional characteristics and provides poor robustness. In recent years, some studies have used multimodal thinking to monitor single emotions such as driver fatigue and anger, but in actual driving environments, negative emotions such as sadness, anger, fear, and fatigue all have a significant impact on driving safety. However, there are very few research cases using multimodal data to accurately predict drivers' comprehensive emotions. Therefore, based on the multi-modal idea, this paper aims to improve drivers' comprehensive emotion recognition. By combining the three modalities of a driver's voice, facial image, and video sequence, the six classification tasks of drivers' emotions are performed as follows: sadness, anger, fear, fatigue, happiness, and emotional neutrality. In order to accurately identify drivers' negative emotions to improve driving safety, this paper proposes a multi-modal fusion framework based on the CNN + Bi-LSTM + HAM to identify driver emotions. The framework fuses feature vectors of driver audio, facial expressions, and video sequences for comprehensive driver emotion recognition. Experiments have proved the effectiveness of the multi-modal data proposed in this paper for driver emotion recognition, and its recognition accuracy has reached 85.52%. At the same time, the validity of this method is verified by comparing experiments and evaluation indicators such as accuracy and F1 score.
Collapse
Affiliation(s)
- Dongmei Zhou
- School of Mechanical and Electrical Engineering, Chengdu University of Technology, Chengdu 610059, China; (D.Z.); (L.W.); (H.L.)
| | - Yongjian Cheng
- School of Mechanical and Electrical Engineering, Chengdu University of Technology, Chengdu 610059, China; (D.Z.); (L.W.); (H.L.)
| | - Luhan Wen
- School of Mechanical and Electrical Engineering, Chengdu University of Technology, Chengdu 610059, China; (D.Z.); (L.W.); (H.L.)
| | - Hao Luo
- School of Mechanical and Electrical Engineering, Chengdu University of Technology, Chengdu 610059, China; (D.Z.); (L.W.); (H.L.)
| | - Ying Liu
- China Unicom Digital Technology Co., Ltd. Hubei Branch, Wuhan 430015, China;
| |
Collapse
|
28
|
Balel Y, Mercuri LG. Does Emotional State Improve Following Temporomandibular Joint Total Joint Replacement? J Oral Maxillofac Surg 2023; 81:1196-1203. [PMID: 37490998 DOI: 10.1016/j.joms.2023.06.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 06/23/2023] [Accepted: 06/26/2023] [Indexed: 07/27/2023]
Abstract
BACKGROUND Temporomandibular joint total joint replacement (TMJTJR) offers patients the opportunity for improved function and reduced pain. TMJTJR also has the potential to affect a patient's emotions in a positive or negative manner. PURPOSE The purpose of this study was to evaluate changes in emotional state for subjects undergoing TMJTJR. STUDY DESIGN, SETTING, SAMPLE The authors implemented a retrospective cohort study. Subjects who received TMJTJR were identified from the TMJ Inter Network, which is a study group comprising more than 130 temporomandibular joint surgeons. Subjects between the ages of 18 and 65 years with complete medical records and pre/post TMJTJR video/audio recordings were enrolled in the study. PREDICTOR VARIABLE The predictor variable was time (preoperative and postoperative). MAIN OUTCOME VARIABLES The primary outcome variable is change in the emotional state. All subjects had preoperative (T0) recorded interview as well as a postoperative (T1) interview at 3 to 6 months. The eight-category emotional state was classified as neutral, happy, sad, angry, fearful, disgusted, surprised, and bored. The three-category emotional state was classified as neutral, positive, and negative. The emotional state was measured using artificial intelligence at T0 and T1. The secondary outcome variable was pain score and maximal interincisal opening. COVARIATES The covariates are gender, age, diagnosis, prosthetic side, TMJTJR design, and TMJTJR type. ANALYSES The relationship between emotional state change and covariates was examined using both the χ2 test and the Kruskal-Wallis H test. The significance of the change in categorical data after surgery was examined using the McNemar-Bowker test. P values < .05 were considered statistically significant. RESULTS Thirty-three subjects were included in the study. The mean age was 30.09 ± 8.69 with 15 males (45%) and 18 females (55%). The percentage of subjects with preoperative neutral, happy, sad, angry, and fearful emotional states was 24, 15, 24, 9, and 27%, respectively. The percentage of subjects with postoperative neutral, happy, sad, angry, and fearful emotional states was 21, 39, 21, 12, and 6%, respectively. The change in emotional state was statistically significant (P = .037). There was no statistically significant relationship between covariates and emotional state changes (P > .05). CONCLUSION According to the assessment of artificial intelligence, TMJTJR improves the emotional state of patients.
Collapse
Affiliation(s)
- Yunus Balel
- Consultant, Department of Oral and Maxillofacial Surgery, Faculty of Dentistry, Tokat Gaziosmanpaşa University, Tokat, Turkey; Consultant, Department of Oral and Maxillofacial Surgery, TR Ministry of Health, Oral and Dental Health Hospital, Sivas, Turkey.
| | - Louis G Mercuri
- Visiting Professor, Department of Orthopedic Surgery, Rush University Medical Center, Chicago, IL
| |
Collapse
|
29
|
K A, Prasad S, Chakrabarty M. Trait anxiety modulates the detection sensitivity of negative affect in speech: an online pilot study. Front Behav Neurosci 2023; 17:1240043. [PMID: 37744950 PMCID: PMC10512416 DOI: 10.3389/fnbeh.2023.1240043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Accepted: 08/21/2023] [Indexed: 09/26/2023] Open
Abstract
Acoustic perception of emotions in speech is relevant for humans to navigate the social environment optimally. While sensory perception is known to be influenced by ambient noise, and bodily internal states (e.g., emotional arousal and anxiety), their relationship to human auditory perception is relatively less understood. In a supervised, online pilot experiment sans the artificially controlled laboratory environment, we asked if the detection sensitivity of emotions conveyed by human speech-in-noise (acoustic signals) varies between individuals with relatively lower and higher levels of subclinical trait-anxiety, respectively. In a task, participants (n = 28) accurately discriminated the target emotion conveyed by the temporally unpredictable acoustic signals (signal to noise ratio = 10 dB), which were manipulated at four levels (Happy, Neutral, Fear, and Disgust). We calculated the empirical area under the curve (a measure of acoustic signal detection sensitivity) based on signal detection theory to answer our questions. A subset of individuals with High trait-anxiety relative to Low in the above sample showed significantly lower detection sensitivities to acoustic signals of negative emotions - Disgust and Fear and significantly lower detection sensitivities to acoustic signals when averaged across all emotions. The results from this pilot study with a small but statistically relevant sample size suggest that trait-anxiety levels influence the overall acoustic detection of speech-in-noise, especially those conveying threatening/negative affect. The findings are relevant for future research on acoustic perception anomalies underlying affective traits and disorders.
Collapse
Affiliation(s)
- Achyuthanand K
- Department of Computational Biology, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Saurabh Prasad
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| | - Mrinmoy Chakrabarty
- Department of Social Sciences and Humanities, Indraprastha Institute of Information Technology Delhi, New Delhi, India
- Centre for Design and New Media, Indraprastha Institute of Information Technology Delhi, New Delhi, India
| |
Collapse
|
30
|
Şentürk YD, Tavacioglu EE, Duymaz İ, Sayim B, Alp N. The Sabancı University Dynamic Face Database (SUDFace): Development and validation of an audiovisual stimulus set of recited and free speeches with neutral facial expressions. Behav Res Methods 2023; 55:3078-3099. [PMID: 36018484 DOI: 10.3758/s13428-022-01951-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/06/2022] [Indexed: 11/08/2022]
Abstract
Faces convey a wide range of information, including one's identity, and emotional and mental states. Face perception is a major research topic in many research fields, such as cognitive science, social psychology, and neuroscience. Frequently, stimuli are selected from a range of available face databases. However, even though faces are highly dynamic, most databases consist of static face stimuli. Here, we introduce the Sabancı University Dynamic Face (SUDFace) database. The SUDFace database consists of 150 high-resolution audiovisual videos acquired in a controlled lab environment and stored with a resolution of 1920 × 1080 pixels at a frame rate of 60 Hz. The multimodal database consists of three videos of each human model in frontal view in three different conditions: vocalizing two scripted texts (conditions 1 and 2) and one Free Speech (condition 3). The main focus of the SUDFace database is to provide a large set of dynamic faces with neutral facial expressions and natural speech articulation. Variables such as face orientation, illumination, and accessories (piercings, earrings, facial hair, etc.) were kept constant across all stimuli. We provide detailed stimulus information, including facial features (pixel-wise calculations of face length, eye width, etc.) and speeches (e.g., duration of speech and repetitions). In two validation experiments, a total number of 227 participants rated each video on several psychological dimensions (e.g., neutralness and naturalness of expressions, valence, and the perceived mental states of the models) using Likert scales. The database is freely accessible for research purposes.
Collapse
Affiliation(s)
| | | | - İlker Duymaz
- Psychology, Sabancı University, Orta Mahalle, Tuzla, İstanbul, 34956, Turkey
| | - Bilge Sayim
- SCALab - Sciences Cognitives et Sciences Affectives, Université de Lille, CNRS, Lille, France
- Institute of Psychology, University of Bern, Fabrikstrasse 8, 3012, Bern, Switzerland
| | - Nihan Alp
- Psychology, Sabancı University, Orta Mahalle, Tuzla, İstanbul, 34956, Turkey.
| |
Collapse
|
31
|
Alhinti L, Cunningham S, Christensen H. The Dysarthric Expressed Emotional Database (DEED): An audio-visual database in British English. PLoS One 2023; 18:e0287971. [PMID: 37549162 PMCID: PMC10406321 DOI: 10.1371/journal.pone.0287971] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Accepted: 06/19/2023] [Indexed: 08/09/2023] Open
Abstract
The Dysarthric Expressed Emotional Database (DEED) is a novel, parallel multimodal (audio-visual) database of dysarthric and typical emotional speech in British English which is a first of its kind. It is an induced (elicited) emotional database that includes speech recorded in the six basic emotions: "happiness", "sadness", "anger", "surprise", "fear", and "disgust". A "neutral" state has also been recorded as a baseline condition. The dysarthric speech part includes recordings from 4 speakers: one female speaker with dysarthria due to cerebral palsy and 3 speakers with dysarthria due to Parkinson's disease (2 female and 1 male). The typical speech part includes recordings from 21 typical speakers (9 female and 12 male). This paper describes the collection of the database, covering its design, development, technical information related to the data capture, and description of the data files and presents the validation methodology. The database was validated subjectively (human performance) and objectively (automatic recognition). The achieved results demonstrated that this database will be a valuable resource for understanding emotion communication by people with dysarthria and useful in the research field of dysarthric emotion classification. The database is freely available for research purposes under a Creative Commons licence at: https://sites.google.com/sheffield.ac.uk/deed.
Collapse
Affiliation(s)
- Lubna Alhinti
- Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
| | - Stuart Cunningham
- Health Sciences School, University of Sheffield, Sheffield, United Kingdom
- Centre for Assistive Technology and Connected Healthcare (CATCH), Sheffield, United Kingdom
| | - Heidi Christensen
- Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
- Centre for Assistive Technology and Connected Healthcare (CATCH), Sheffield, United Kingdom
| |
Collapse
|
32
|
Johnson KT, Narain J, Quatieri T, Maes P, Picard RW. ReCANVo: A database of real-world communicative and affective nonverbal vocalizations. Sci Data 2023; 10:523. [PMID: 37543663 PMCID: PMC10404278 DOI: 10.1038/s41597-023-02405-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Accepted: 07/24/2023] [Indexed: 08/07/2023] Open
Abstract
Nonverbal vocalizations, such as sighs, grunts, and yells, are informative expressions within typical verbal speech. Likewise, individuals who produce 0-10 spoken words or word approximations ("minimally speaking" individuals) convey rich affective and communicative information through nonverbal vocalizations even without verbal speech. Yet, despite their rich content, little to no data exists on the vocal expressions of this population. Here, we present ReCANVo: Real-World Communicative and Affective Nonverbal Vocalizations - a novel dataset of non-speech vocalizations labeled by function from minimally speaking individuals. The ReCANVo database contains over 7000 vocalizations spanning communicative and affective functions from eight minimally speaking individuals, along with communication profiles for each participant. Vocalizations were recorded in real-world settings and labeled in real-time by a close family member who knew the communicator well and had access to contextual information while labeling. ReCANVo is a novel database of nonverbal vocalizations from minimally speaking individuals, the largest available dataset of nonverbal vocalizations, and one of the only affective speech datasets collected amidst daily life across contexts.
Collapse
Affiliation(s)
- Kristina T Johnson
- Massachusetts Institute of Technology, MIT Media Lab, Cambridge, MA, USA.
| | - Jaya Narain
- Massachusetts Institute of Technology, MIT Media Lab, Cambridge, MA, USA.
| | - Thomas Quatieri
- Massachusetts Institute of Technology, Lincoln Laboratory, Lexington, MA, USA
| | - Pattie Maes
- Massachusetts Institute of Technology, MIT Media Lab, Cambridge, MA, USA
| | - Rosalind W Picard
- Massachusetts Institute of Technology, MIT Media Lab, Cambridge, MA, USA
| |
Collapse
|
33
|
Pulatov I, Oteniyazov R, Makhmudov F, Cho YI. Enhancing Speech Emotion Recognition Using Dual Feature Extraction Encoders. SENSORS (BASEL, SWITZERLAND) 2023; 23:6640. [PMID: 37514933 PMCID: PMC10383041 DOI: 10.3390/s23146640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 07/21/2023] [Accepted: 07/21/2023] [Indexed: 07/30/2023]
Abstract
Understanding and identifying emotional cues in human speech is a crucial aspect of human-computer communication. The application of computer technology in dissecting and deciphering emotions, along with the extraction of relevant emotional characteristics from speech, forms a significant part of this process. The objective of this study was to architect an innovative framework for speech emotion recognition predicated on spectrograms and semantic feature transcribers, aiming to bolster performance precision by acknowledging the conspicuous inadequacies in extant methodologies and rectifying them. To procure invaluable attributes for speech detection, this investigation leveraged two divergent strategies. Primarily, a wholly convolutional neural network model was engaged to transcribe speech spectrograms. Subsequently, a cutting-edge Mel-frequency cepstral coefficient feature abstraction approach was adopted and integrated with Speech2Vec for semantic feature encoding. These dual forms of attributes underwent individual processing before they were channeled into a long short-term memory network and a comprehensive connected layer for supplementary representation. By doing so, we aimed to bolster the sophistication and efficacy of our speech emotion detection model, thereby enhancing its potential to accurately recognize and interpret emotion from human speech. The proposed mechanism underwent a rigorous evaluation process employing two distinct databases: RAVDESS and EMO-DB. The outcome displayed a predominant performance when juxtaposed with established models, registering an impressive accuracy of 94.8% on the RAVDESS dataset and a commendable 94.0% on the EMO-DB dataset. This superior performance underscores the efficacy of our innovative system in the realm of speech emotion recognition, as it outperforms current frameworks in accuracy metrics.
Collapse
Affiliation(s)
- Ilkhomjon Pulatov
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| | - Rashid Oteniyazov
- Department of Telecommunication Engineering, Nukus Branch of Tashkent University of Information Technologies Named after Muhammad Al-Khwarizmi, Nukus 230100, Uzbekistan
| | - Fazliddin Makhmudov
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| | - Young-Im Cho
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| |
Collapse
|
34
|
Ullah R, Asif M, Shah WA, Anjam F, Ullah I, Khurshaid T, Wuttisittikulkij L, Shah S, Ali SM, Alibakhshikenari M. Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer. SENSORS (BASEL, SWITZERLAND) 2023; 23:6212. [PMID: 37448062 DOI: 10.3390/s23136212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 05/26/2023] [Accepted: 06/04/2023] [Indexed: 07/15/2023]
Abstract
Speech emotion recognition (SER) is a challenging task in human-computer interaction (HCI) systems. One of the key challenges in speech emotion recognition is to extract the emotional features effectively from a speech utterance. Despite the promising results of recent studies, they generally do not leverage advanced fusion algorithms for the generation of effective representations of emotional features in speech utterances. To address this problem, we describe the fusion of spatial and temporal feature representations of speech emotion by parallelizing convolutional neural networks (CNNs) and a Transformer encoder for SER. We stack two parallel CNNs for spatial feature representation in parallel to a Transformer encoder for temporal feature representation, thereby simultaneously expanding the filter depth and reducing the feature map with an expressive hierarchical feature representation at a lower computational cost. We use the RAVDESS dataset to recognize eight different speech emotions. We augment and intensify the variations in the dataset to minimize model overfitting. Additive White Gaussian Noise (AWGN) is used to augment the RAVDESS dataset. With the spatial and sequential feature representations of CNNs and the Transformer, the SER model achieves 82.31% accuracy for eight emotions on a hold-out dataset. In addition, the SER system is evaluated with the IEMOCAP dataset and achieves 79.42% recognition accuracy for five emotions. Experimental results on the RAVDESS and IEMOCAP datasets show the success of the presented SER system and demonstrate an absolute performance improvement over the state-of-the-art (SOTA) models.
Collapse
Affiliation(s)
- Rizwan Ullah
- Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand
| | - Muhammad Asif
- Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu 28100, Pakistan
| | - Wahab Ali Shah
- Department of Electrical Engineering, Namal University, Mianwali 42250, Pakistan
| | - Fakhar Anjam
- Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu 28100, Pakistan
| | - Ibrar Ullah
- Department of Electrical Engineering, Kohat Campus, University of Engineering and Technology Peshawar, Kohat 25000, Pakistan
| | - Tahir Khurshaid
- Department of Electrical Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea
| | - Lunchakorn Wuttisittikulkij
- Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand
| | - Shashi Shah
- Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand
| | - Syed Mansoor Ali
- Department of Physics and Astronomy, College of Science, King Saud University, P.O. Box 2455, Riyadh 11451, Saudi Arabia
| | - Mohammad Alibakhshikenari
- Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés, 28911 Madrid, Spain
| |
Collapse
|
35
|
John V, Kawanishi Y. Progressive Learning of a Multimodal Classifier Accounting for Different Modality Combinations. SENSORS (BASEL, SWITZERLAND) 2023; 23:4666. [PMID: 37430579 DOI: 10.3390/s23104666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 05/09/2023] [Accepted: 05/09/2023] [Indexed: 07/12/2023]
Abstract
In classification tasks, such as face recognition and emotion recognition, multimodal information is used for accurate classification. Once a multimodal classification model is trained with a set of modalities, it estimates the class label by using the entire modality set. A trained classifier is typically not formulated to perform classification for various subsets of modalities. Thus, the model would be useful and portable if it could be used for any subset of modalities. We refer to this problem as the multimodal portability problem. Moreover, in the multimodal model, classification accuracy is reduced when one or more modalities are missing. We term this problem the missing modality problem. This article proposes a novel deep learning model, termed KModNet, and a novel learning strategy, termed progressive learning, to simultaneously address missing modality and multimodal portability problems. KModNet, formulated with the transformer, contains multiple branches corresponding to different k-combinations of the modality set S. KModNet is trained using a multi-step progressive learning framework, where the k-th step uses a k-modal model to train different branches up to the k-th combination branch. To address the missing modality problem, the training multimodal data is randomly ablated. The proposed learning framework is formulated and validated using two multimodal classification problems: audio-video-thermal person classification and audio-video emotion classification. The two classification problems are validated using the Speaking Faces, RAVDESS, and SAVEE datasets. The results demonstrate that the progressive learning framework enhances the robustness of multimodal classification, even under the conditions of missing modalities, while being portable to different modality subsets.
Collapse
Affiliation(s)
- Vijay John
- Guardian Robot Project, RIKEN, Seika-cho, Kyoto 619-0288, Japan
| | | |
Collapse
|
36
|
Razzaq MA, Hussain J, Bang J, Hua CH, Satti FA, Rehman UU, Bilal HSM, Kim ST, Lee S. A Hybrid Multimodal Emotion Recognition Framework for UX Evaluation Using Generalized Mixture Functions. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23094373. [PMID: 37177574 PMCID: PMC10181635 DOI: 10.3390/s23094373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 04/03/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023]
Abstract
Multimodal emotion recognition has gained much traction in the field of affective computing, human-computer interaction (HCI), artificial intelligence (AI), and user experience (UX). There is growing demand to automate analysis of user emotion towards HCI, AI, and UX evaluation applications for providing affective services. Emotions are increasingly being used, obtained through the videos, audio, text or physiological signals. This has led to process emotions from multiple modalities, usually combined through ensemble-based systems with static weights. Due to numerous limitations like missing modality data, inter-class variations, and intra-class similarities, an effective weighting scheme is thus required to improve the aforementioned discrimination between modalities. This article takes into account the importance of difference between multiple modalities and assigns dynamic weights to them by adapting a more efficient combination process with the application of generalized mixture (GM) functions. Therefore, we present a hybrid multimodal emotion recognition (H-MMER) framework using multi-view learning approach for unimodal emotion recognition and introducing multimodal feature fusion level, and decision level fusion using GM functions. In an experimental study, we evaluated the ability of our proposed framework to model a set of four different emotional states (Happiness, Neutral, Sadness, and Anger) and found that most of them can be modeled well with significantly high accuracy using GM functions. The experiment shows that the proposed framework can model emotional states with an average accuracy of 98.19% and indicates significant gain in terms of performance in contrast to traditional approaches. The overall evaluation results indicate that we can identify emotional states with high accuracy and increase the robustness of an emotion classification system required for UX measurement.
Collapse
Affiliation(s)
- Muhammad Asif Razzaq
- Department of Computer Science, Fatima Jinnah Women University, Rawalpindi 46000, Pakistan
- Ubiquitous Computing Lab, Department of Computer Science and Engineering, Kyung Hee University, Seocheon-dong, Giheung-gu, Yongin-si 17104, Republic of Korea
| | - Jamil Hussain
- Department of Data Science, Sejong University, Seoul 30019, Republic of Korea
| | - Jaehun Bang
- Hanwha Corporation/Momentum, Hanwha Building, 86 Cheonggyecheon-ro, Jung-gu, Seoul 04541, Republic of Korea
| | - Cam-Hao Hua
- Ubiquitous Computing Lab, Department of Computer Science and Engineering, Kyung Hee University, Seocheon-dong, Giheung-gu, Yongin-si 17104, Republic of Korea
| | - Fahad Ahmed Satti
- Ubiquitous Computing Lab, Department of Computer Science and Engineering, Kyung Hee University, Seocheon-dong, Giheung-gu, Yongin-si 17104, Republic of Korea
- Department of Computing, School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan
| | - Ubaid Ur Rehman
- Ubiquitous Computing Lab, Department of Computer Science and Engineering, Kyung Hee University, Seocheon-dong, Giheung-gu, Yongin-si 17104, Republic of Korea
- Department of Computing, School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan
| | - Hafiz Syed Muhammad Bilal
- Department of Computing, School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad 44000, Pakistan
| | - Seong Tae Kim
- Ubiquitous Computing Lab, Department of Computer Science and Engineering, Kyung Hee University, Seocheon-dong, Giheung-gu, Yongin-si 17104, Republic of Korea
| | - Sungyoung Lee
- Ubiquitous Computing Lab, Department of Computer Science and Engineering, Kyung Hee University, Seocheon-dong, Giheung-gu, Yongin-si 17104, Republic of Korea
| |
Collapse
|
37
|
Heffer N, Dennie E, Ashwin C, Petrini K, Karl A. Multisensory processing of emotional cues predicts intrusive memories after virtual reality trauma. VIRTUAL REALITY 2023; 27:2043-2057. [PMID: 37614716 PMCID: PMC10442266 DOI: 10.1007/s10055-023-00784-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 03/03/2023] [Indexed: 08/25/2023]
Abstract
Research has shown that high trait anxiety can alter multisensory processing of threat cues (by amplifying integration of angry faces and voices); however, it remains unknown whether differences in multisensory processing play a role in the psychological response to trauma. This study examined the relationship between multisensory emotion processing and intrusive memories over seven days following exposure to an analogue trauma in a sample of 55 healthy young adults. We used an adapted version of the trauma film paradigm, where scenes showing a car accident trauma were presented using virtual reality, rather than a conventional 2D film. Multisensory processing was assessed prior to the trauma simulation using a forced choice emotion recognition paradigm with happy, sad and angry voice-only, face-only, audiovisual congruent (face and voice expressed matching emotions) and audiovisual incongruent expressions (face and voice expressed different emotions). We found that increased accuracy in recognising anger (but not happiness and sadness) in the audiovisual condition relative to the voice- and face-only conditions was associated with more intrusions following VR trauma. Despite previous results linking trait anxiety and intrusion development, no significant influence of trait anxiety on intrusion frequency was observed. Enhanced integration of threat-related information (i.e. angry faces and voices) could lead to overly threatening appraisals of stressful life events and result in greater intrusion development after trauma. Supplementary Information The online version contains supplementary material available at 10.1007/s10055-023-00784-1.
Collapse
Affiliation(s)
- Naomi Heffer
- Department of Psychology, University of Bath, Claverton Down, Bath, BA2 7AY UK
- School of Sciences, Bath Spa University, Bath, UK
| | - Emma Dennie
- Mood Disorders Centre, University of Exeter, Exeter, UK
| | - Chris Ashwin
- Department of Psychology, University of Bath, Claverton Down, Bath, BA2 7AY UK
- Centre for Applied Autism Research (CAAR), Bath, UK
| | - Karin Petrini
- Department of Psychology, University of Bath, Claverton Down, Bath, BA2 7AY UK
- The Centre for the Analysis of Motion, Entertainment Research and Applications (CAMERA), Bath, UK
| | - Anke Karl
- Mood Disorders Centre, University of Exeter, Exeter, UK
| |
Collapse
|
38
|
Tanko D, Demir FB, Dogan S, Sahin SE, Tuncer T. Automated speech emotion polarization for a distance education system based on orbital local binary pattern and an appropriate sub-band selection technique. MULTIMEDIA TOOLS AND APPLICATIONS 2023:1-18. [PMID: 37362680 PMCID: PMC10068203 DOI: 10.1007/s11042-023-14648-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 08/02/2022] [Accepted: 02/03/2023] [Indexed: 06/28/2023]
Abstract
The distance education system was widely adopted during the Covid-19 pandemic by many institutions of learning. To measure the effectiveness of this system, it is essential to evaluate the performance of the lecturers. To this end, an automated speech emotion recognition model is a solution. This research aims to develop an accurate speech emotion recognition model that will check the lecturers/instructors' emotional state during lecture presentations. A new speech emotion dataset is collected, and an automated speech emotion recognition (SER) model is proposed to achieve this aim. The presented SER model contains three main phases, which are (i) feature extraction using multi-level discrete wavelet transform (DWT) and one-dimensional orbital local binary pattern (1D-OLBP), (ii) feature selection using neighborhood component analysis (NCA), (iii) classification using support vector machine (SVM) with ten-fold cross-validation. The proposed 1D-OLBP and NCA-based model is tested on the collected dataset, containing three emotional states with 7101 sound segments. The presented 1D-OLBP and NCA-based technique achieved a 93.40% classification accuracy using the proposed model on the new dataset. Moreover, the proposed architecture has been tested on the three publicly available speech emotion recognition datasets to highlight the general classification ability of this self-organized model. We reached over 70% classification accuracies for all three public datasets, and these results demonstrated the success of this model.
Collapse
Affiliation(s)
- Dahiru Tanko
- Department of Digital Forensics Engineering, College of Technology, Firat University, Elazig, Turkey
| | - Fahrettin Burak Demir
- Deparment of Software Engineering, Faculty of Engineering and Natural Sciences, Bandirma Onyedi Eylul University, Bandirma, Turkey
| | - Sengul Dogan
- Department of Digital Forensics Engineering, College of Technology, Firat University, Elazig, Turkey
| | - Sakir Engin Sahin
- Department of Computer Technologies, Arapgir Vocational School, Malatya Turgut Ozal University, Malatya, Turkey
| | - Turker Tuncer
- Department of Digital Forensics Engineering, College of Technology, Firat University, Elazig, Turkey
| |
Collapse
|
39
|
Gong B, Li N, Li Q, Yan X, Chen J, Li L, Wu X, Wu C. The Mandarin Chinese auditory emotions stimulus database: A validated set of Chinese pseudo-sentences. Behav Res Methods 2023; 55:1441-1459. [PMID: 35641682 DOI: 10.3758/s13428-022-01868-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/29/2022] [Indexed: 11/08/2022]
Abstract
Emotional prosody is fully embedded in language and can be influenced by the linguistic properties of a specific language. Considering the limitations of existing Chinese auditory stimulus database studies, we developed and validated an emotional auditory stimuli database composed of Chinese pseudo-sentences, recorded by six professional actors in Mandarin Chinese. Emotional expressions included happiness, sadness, anger, fear, disgust, pleasant surprise, and neutrality. All emotional categories were vocalized into two types of sentence patterns, declarative and interrogative. In addition, all emotional pseudo-sentences, except for neutral, were vocalized at two levels of emotional intensity: normal and strong. Each recording was validated with 40 native Chinese listeners in terms of the recognition accuracy of the intended emotion portrayal; finally, 4361 pseudo-sentence stimuli were included in the database. Validation of the database using a forced-choice recognition paradigm revealed high rates of emotional recognition accuracy. The detailed acoustic attributes of vocalization were provided and connected to the emotion recognition rates. This corpus could be a valuable resource for researchers and clinicians to explore the behavioral and neural mechanisms underlying emotion processing of the general population and emotional disturbances in neurological, psychiatric, and developmental disorders. The Mandarin Chinese auditory emotion stimulus database is available at the Open Science Framework ( https://osf.io/sfbm6/?view_only=e22a521e2a7d44c6b3343e11b88f39e3 ).
Collapse
Affiliation(s)
- Bingyan Gong
- School of Nursing, Peking University Health Science Center, Room 510, 38 Xueyuan Road, Haidian District, Beijing, 100191, China
| | - Na Li
- Theatre Pedagogy Department, Central Academy of Drama, Beijing, 100710, China
| | - Qiuhong Li
- School of Nursing, Peking University Health Science Center, Room 510, 38 Xueyuan Road, Haidian District, Beijing, 100191, China
| | - Xinyuan Yan
- School of Computing, University of Utah, Salt Lake City, UT, USA
| | - Jing Chen
- Department of Machine Intelligence, Peking University, 5 Yiheyuan Road, Haidian District, Beijing, 100871, China
- Speech and Hearing Research Center, Key Laboratory on Machine Perception (Ministry of Education), Peking University, Beijing, 100871, China
| | - Liang Li
- School of Psychological and Cognitive Sciences, Peking University, Beijing, 100871, China
| | - Xihong Wu
- Department of Machine Intelligence, Peking University, 5 Yiheyuan Road, Haidian District, Beijing, 100871, China.
- Speech and Hearing Research Center, Key Laboratory on Machine Perception (Ministry of Education), Peking University, Beijing, 100871, China.
| | - Chao Wu
- School of Nursing, Peking University Health Science Center, Room 510, 38 Xueyuan Road, Haidian District, Beijing, 100191, China.
| |
Collapse
|
40
|
Reece A, Cooney G, Bull P, Chung C, Dawson B, Fitzpatrick C, Glazer T, Knox D, Liebscher A, Marin S. The CANDOR corpus: Insights from a large multimodal dataset of naturalistic conversation. SCIENCE ADVANCES 2023; 9:eadf3197. [PMID: 37000886 PMCID: PMC10065445 DOI: 10.1126/sciadv.adf3197] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Accepted: 03/02/2023] [Indexed: 06/19/2023]
Abstract
People spend a substantial portion of their lives engaged in conversation, and yet, our scientific understanding of conversation is still in its infancy. Here, we introduce a large, novel, and multimodal corpus of 1656 conversations recorded in spoken English. This 7+ million word, 850-hour corpus totals more than 1 terabyte of audio, video, and transcripts, with moment-to-moment measures of vocal, facial, and semantic expression, together with an extensive survey of speakers' postconversation reflections. By taking advantage of the considerable scope of the corpus, we explore many examples of how this large-scale public dataset may catalyze future research, particularly across disciplinary boundaries, as scholars from a variety of fields appear increasingly interested in the study of conversation.
Collapse
Affiliation(s)
| | - Gus Cooney
- University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Peter Bull
- DrivenData Inc., Berkeley, CA, 94709, USA
| | | | | | | | | | - Dean Knox
- University of Pennsylvania, Philadelphia, PA 19104, USA
| | | | | |
Collapse
|
41
|
Cronin SL, Lipp OV, Marinovic W. Pupil Dilation During Encoding, But Not Type of Auditory Stimulation, Predicts Recognition Success in Face Memory. Biol Psychol 2023; 178:108547. [PMID: 36972756 DOI: 10.1016/j.biopsycho.2023.108547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 03/19/2023] [Accepted: 03/24/2023] [Indexed: 03/29/2023]
Abstract
We encounter and process information from multiple sensory modalities in our daily lives, and research suggests that learning can be more efficient when contexts are multisensory. In this study, we were interested in whether face identity recognition memory might be improved in multisensory learning conditions, and to explore associated changes in pupil dilation during encoding and recognition. In two studies participants completed old/new face recognition tasks wherein visual face stimuli were presented in the context of sounds. Faces were learnt alongside no sound, low arousal sounds (Experiment 1), high arousal non-face relevant, or high arousal face relevant (Experiment 2) sounds. We predicted that the presence of sounds during encoding would improve later recognition accuracy, however, the results did not support this with no effect of sound condition on memory. Pupil dilation, however, was found to predict later successful recognition both at encoding and during recognition. While these results do not provide support to the notion that face learning is improved under multisensory conditions relative to unisensory conditions, they do suggest that pupillometry may be a useful tool to further explore face identity learning and recognition.
Collapse
Affiliation(s)
- Sophie L Cronin
- School of Population Health, Discipline of Psychology, Curtin University, Perth, Western Australia
| | - Ottmar V Lipp
- School of Psychology and Counselling, Queensland University of Technology, Brisbane, Queensland, Australia
| | - Welber Marinovic
- School of Population Health, Discipline of Psychology, Curtin University, Perth, Western Australia.
| |
Collapse
|
42
|
Hajek P, Munk M. Speech emotion recognition and text sentiment analysis for financial distress prediction. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08470-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
Abstract
AbstractIn recent years, there has been an increasing interest in text sentiment analysis and speech emotion recognition in finance due to their potential to capture the intentions and opinions of corporate stakeholders, such as managers and investors. A considerable performance improvement in forecasting company financial performance was achieved by taking textual sentiment into account. However, far too little attention has been paid to managerial emotional states and their potential contribution to financial distress prediction. This study seeks to address this problem by proposing a deep learning architecture that uniquely combines managerial emotional states extracted using speech emotion recognition with FinBERT-based sentiment analysis of earnings conference call transcripts. Thus, the obtained information is fused with traditional financial indicators to achieve a more accurate prediction of financial distress. The proposed model is validated using 1278 earnings conference calls of the 40 largest US companies. The findings of this study provide evidence on the essential role of managerial emotions in predicting financial distress, even when compared with sentiment indicators obtained from text. The experimental results also demonstrate the high accuracy of the proposed model compared with state-of-the-art prediction models.
Collapse
|
43
|
Olatinwo DD, Abu-Mahfouz A, Hancke G, Myburgh H. IoT-Enabled WBAN and Machine Learning for Speech Emotion Recognition in Patients. SENSORS (BASEL, SWITZERLAND) 2023; 23:2948. [PMID: 36991659 PMCID: PMC10056097 DOI: 10.3390/s23062948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 02/27/2023] [Accepted: 03/03/2023] [Indexed: 06/19/2023]
Abstract
Internet of things (IoT)-enabled wireless body area network (WBAN) is an emerging technology that combines medical devices, wireless devices, and non-medical devices for healthcare management applications. Speech emotion recognition (SER) is an active research field in the healthcare domain and machine learning. It is a technique that can be used to automatically identify speakers' emotions from their speech. However, the SER system, especially in the healthcare domain, is confronted with a few challenges. For example, low prediction accuracy, high computational complexity, delay in real-time prediction, and how to identify appropriate features from speech. Motivated by these research gaps, we proposed an emotion-aware IoT-enabled WBAN system within the healthcare framework where data processing and long-range data transmissions are performed by an edge AI system for real-time prediction of patients' speech emotions as well as to capture the changes in emotions before and after treatment. Additionally, we investigated the effectiveness of different machine learning and deep learning algorithms in terms of performance classification, feature extraction methods, and normalization methods. We developed a hybrid deep learning model, i.e., convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM), and a regularized CNN model. We combined the models with different optimization strategies and regularization techniques to improve the prediction accuracy, reduce generalization error, and reduce the computational complexity of the neural networks in terms of their computational time, power, and space. Different experiments were performed to check the efficiency and effectiveness of the proposed machine learning and deep learning algorithms. The proposed models are compared with a related existing model for evaluation and validation using standard performance metrics such as prediction accuracy, precision, recall, F1 score, confusion matrix, and the differences between the actual and predicted values. The experimental results proved that one of the proposed models outperformed the existing model with an accuracy of about 98%.
Collapse
Affiliation(s)
- Damilola D. Olatinwo
- Department of Electrical, Electronic and Computer Engineering, University of Pretoria, Pretoria 0001, South Africa
| | - Adnan Abu-Mahfouz
- Department of Electrical, Electronic and Computer Engineering, University of Pretoria, Pretoria 0001, South Africa
- Council for Scientific and Industrial Research (CSIR), Pretoria 0184, South Africa
| | - Gerhard Hancke
- Department of Electrical, Electronic and Computer Engineering, University of Pretoria, Pretoria 0001, South Africa
- Department of Computer Science, City University of Hong Kong, Hong Kong, China
| | - Hermanus Myburgh
- Department of Electrical, Electronic and Computer Engineering, University of Pretoria, Pretoria 0001, South Africa
| |
Collapse
|
44
|
Aspect-Based Sentiment Analysis of Customer Speech Data Using Deep Convolutional Neural Network and BiLSTM. Cognit Comput 2023. [DOI: 10.1007/s12559-023-10127-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/08/2023]
|
45
|
van Rijn P, Larrouy-Maestri P. Modelling individual and cross-cultural variation in the mapping of emotions to speech prosody. Nat Hum Behav 2023; 7:386-396. [PMID: 36646838 PMCID: PMC10038802 DOI: 10.1038/s41562-022-01505-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 11/28/2022] [Indexed: 01/18/2023]
Abstract
The existence of a mapping between emotions and speech prosody is commonly assumed. We propose a Bayesian modelling framework to analyse this mapping. Our models are fitted to a large collection of intended emotional prosody, yielding more than 3,000 minutes of recordings. Our descriptive study reveals that the mapping within corpora is relatively constant, whereas the mapping varies across corpora. To account for this heterogeneity, we fit a series of increasingly complex models. Model comparison reveals that models taking into account mapping differences across countries, languages, sexes and individuals outperform models that only assume a global mapping. Further analysis shows that differences across individuals, cultures and sexes contribute more to the model prediction than a shared global mapping. Our models, which can be explored in an online interactive visualization, offer a description of the mapping between acoustic features and emotions in prosody.
Collapse
Affiliation(s)
- Pol van Rijn
- Max Planck Institute for Empirical Aesthetics, Frankfurt am Main, Germany.
| | - Pauline Larrouy-Maestri
- Max Planck Institute for Empirical Aesthetics, Frankfurt am Main, Germany
- Max Planck-NYU Center for Language, Music, and Emotion, New York, NY, USA
| |
Collapse
|
46
|
Xia W, Zhang Y, Yang Y, Xue JH, Zhou B, Yang MH. GAN Inversion: A Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:3121-3138. [PMID: 37022469 DOI: 10.1109/tpami.2022.3181070] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
GAN inversion aims to invert a given image back into the latent space of a pretrained GAN model so that the image can be faithfully reconstructed from the inverted code by the generator. As an emerging technique to bridge the real and fake image domains, GAN inversion plays an essential role in enabling pretrained GAN models, such as StyleGAN and BigGAN, for applications of real image editing. Moreover, GAN inversion interprets GAN's latent space and examines how realistic images can be generated. In this paper, we provide a survey of GAN inversion with a focus on its representative algorithms and its applications in image restoration and image manipulation. We further discuss the trends and challenges for future research. A curated list of GAN inversion methods, datasets, and other related information can be found at https://github.com/weihaox/awesome-gan-inversion.
Collapse
|
47
|
Mustaqeem, El Saddik A, Alotaibi FS, Pham NT. AAD-Net: Advanced end-to-end speech signal system for human emotion detection & recognition using attention-based deep echo state network. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
48
|
Ahmad M, Sanawar S, Alfandi O, Qadri SF, Saeed IA, Khan S, Hayat B, Ahmad A. Facial expression recognition using lightweight deep learning modeling. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:8208-8225. [PMID: 37161193 DOI: 10.3934/mbe.2023357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Facial expression is a type of communication and is useful in many areas of computer vision, including intelligent visual surveillance, human-robot interaction and human behavior analysis. A deep learning approach is presented to classify happy, sad, angry, fearful, contemptuous, surprised and disgusted expressions. Accurate detection and classification of human facial expression is a critical task in image processing due to the inconsistencies amid the complexity, including change in illumination, occlusion, noise and the over-fitting problem. A stacked sparse auto-encoder for facial expression recognition (SSAE-FER) is used for unsupervised pre-training and supervised fine-tuning. SSAE-FER automatically extracts features from input images, and the softmax classifier is used to classify the expressions. Our method achieved an accuracy of 92.50% on the JAFFE dataset and 99.30% on the CK+ dataset. SSAE-FER performs well compared to the other comparative methods in the same domain.
Collapse
Affiliation(s)
- Mubashir Ahmad
- Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Tobe Camp, Abbottabad-22060, Pakistan
- Department of Computer Science, the University of Lahore, Sargodha Campus 40100, Pakistan
| | - Saira Sanawar
- Department of Computer Science, the University of Lahore, Sargodha Campus 40100, Pakistan
| | - Omar Alfandi
- College of Technological Innovation at Zayed University in Abu Dhabi, UAE
| | - Syed Furqan Qadri
- Research Center for Healthcare Data Science, Zhejiang Lab, Hangzhou 311121, China
| | - Iftikhar Ahmed Saeed
- Department of Computer Science, the University of Lahore, Sargodha Campus 40100, Pakistan
| | - Salabat Khan
- College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Bashir Hayat
- Department of Computer Science, Institute of Management Sciences, Peshawar, Pakistan
| | - Arshad Ahmad
- Department of IT & CS, Pak-Austria Fachhochschule: Institute of Applied Sciences and Technology (PAF-IAST), Haripur 22620, Pakistan
| |
Collapse
|
49
|
Pucci F, Fedele P, Dimitri GM. Speech emotion recognition with artificial intelligence for contact tracing in the COVID‐19 pandemic. COGNITIVE COMPUTATION AND SYSTEMS 2023. [DOI: 10.1049/ccs2.12076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/10/2023] Open
Affiliation(s)
- Francesco Pucci
- DIISM Universitá degli Studi di Siena Siena Italy
- Blu Pantheon Siena Italy
| | | | | |
Collapse
|
50
|
Leung FYN, Stojanovik V, Micai M, Jiang C, Liu F. Emotion recognition in autism spectrum disorder across age groups: A cross-sectional investigation of various visual and auditory communicative domains. Autism Res 2023; 16:783-801. [PMID: 36727629 DOI: 10.1002/aur.2896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Accepted: 01/19/2023] [Indexed: 02/03/2023]
Abstract
Previous research on emotion processing in autism spectrum disorder (ASD) has predominantly focused on human faces and speech prosody, with little attention paid to other domains such as nonhuman faces and music. In addition, emotion processing in different domains was often examined in separate studies, making it challenging to evaluate whether emotion recognition difficulties in ASD generalize across domains and age cohorts. The present study investigated: (i) the recognition of basic emotions (angry, scared, happy, and sad) across four domains (human faces, face-like objects, speech prosody, and song) in 38 autistic and 38 neurotypical (NT) children, adolescents, and adults in a forced-choice labeling task, and (ii) the impact of pitch and visual processing profiles on this ability. Results showed similar recognition accuracy between the ASD and NT groups across age groups for all domains and emotion types, although processing speed was slower in the ASD compared to the NT group. Age-related differences were seen in both groups, which varied by emotion, domain, and performance index. Visual processing style was associated with facial emotion recognition speed and pitch perception ability with auditory emotion recognition in the NT group but not in the ASD group. These findings suggest that autistic individuals may employ different emotion processing strategies compared to NT individuals, and that emotion recognition difficulties as manifested by slower response times may result from a generalized, rather than a domain-specific underlying mechanism that governs emotion recognition processes across domains in ASD.
Collapse
Affiliation(s)
- Florence Y N Leung
- School of Psychology and Clinical Language Sciences, University of Reading, Reading, UK.,Department of Psychology, University of Bath, Bath, UK
| | - Vesna Stojanovik
- School of Psychology and Clinical Language Sciences, University of Reading, Reading, UK
| | - Martina Micai
- School of Psychology and Clinical Language Sciences, University of Reading, Reading, UK
| | - Cunmei Jiang
- Music College, Shanghai Normal University, Shanghai, China
| | - Fang Liu
- School of Psychology and Clinical Language Sciences, University of Reading, Reading, UK
| |
Collapse
|