1
|
Avola D, Cinque L, Mambro AD, Fagioli A, Marini MR, Pannone D, Fanini B, Foresti GL. Spatio-Temporal Image-Based Encoded Atlases for EEG Emotion Recognition. Int J Neural Syst 2024; 34:2450024. [PMID: 38533631 DOI: 10.1142/s0129065724500242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2024]
Abstract
Emotion recognition plays an essential role in human-human interaction since it is a key to understanding the emotional states and reactions of human beings when they are subject to events and engagements in everyday life. Moving towards human-computer interaction, the study of emotions becomes fundamental because it is at the basis of the design of advanced systems to support a broad spectrum of application areas, including forensic, rehabilitative, educational, and many others. An effective method for discriminating emotions is based on ElectroEncephaloGraphy (EEG) data analysis, which is used as input for classification systems. Collecting brain signals on several channels and for a wide range of emotions produces cumbersome datasets that are hard to manage, transmit, and use in varied applications. In this context, the paper introduces the Empátheia system, which explores a different EEG representation by encoding EEG signals into images prior to their classification. In particular, the proposed system extracts spatio-temporal image encodings, or atlases, from EEG data through the Processing and transfeR of Interaction States and Mappings through Image-based eNcoding (PRISMIN) framework, thus obtaining a compact representation of the input signals. The atlases are then classified through the Empátheia architecture, which comprises branches based on convolutional, recurrent, and transformer models designed and tuned to capture the spatial and temporal aspects of emotions. Extensive experiments were conducted on the Shanghai Jiao Tong University (SJTU) Emotion EEG Dataset (SEED) public dataset, where the proposed system significantly reduced its size while retaining high performance. The results obtained highlight the effectiveness of the proposed approach and suggest new avenues for data representation in emotion recognition from EEG signals.
Collapse
Affiliation(s)
- Danilo Avola
- Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Rome 00198, Italy
| | - Luigi Cinque
- Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Rome 00198, Italy
| | - Angelo Di Mambro
- Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Rome 00198, Italy
| | - Alessio Fagioli
- Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Rome 00198, Italy
| | - Marco Raoul Marini
- Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Rome 00198, Italy
| | - Daniele Pannone
- Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Rome 00198, Italy
| | - Bruno Fanini
- Institute of Heritage Science, National Research Council, Area della Ricerca Roma 1, SP35d, 9, Montelibretti 00010, Italy
| | - Gian Luca Foresti
- Department of Computer Science, Mathematics and Physics, University of Udine, Via delle Scienze 206, Udine 33100, Italy
| |
Collapse
|
2
|
Vicente-Querol MA, Fernández-Caballero A, González P, González-Gualda LM, Fernández-Sotos P, Molina JP, García AS. Effect of Action Units, Viewpoint and Immersion on Emotion Recognition Using Dynamic Virtual Faces. Int J Neural Syst 2023; 33:2350053. [PMID: 37746831 DOI: 10.1142/s0129065723500533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Facial affect recognition is a critical skill in human interactions that is often impaired in psychiatric disorders. To address this challenge, tests have been developed to measure and train this skill. Recently, virtual human (VH) and virtual reality (VR) technologies have emerged as novel tools for this purpose. This study investigates the unique contributions of different factors in the communication and perception of emotions conveyed by VHs. Specifically, it examines the effects of the use of action units (AUs) in virtual faces, the positioning of the VH (frontal or mid-profile), and the level of immersion in the VR environment (desktop screen versus immersive VR). Thirty-six healthy subjects participated in each condition. Dynamic virtual faces (DVFs), VHs with facial animations, were used to represent the six basic emotions and the neutral expression. The results highlight the important role of the accurate implementation of AUs in virtual faces for emotion recognition. Furthermore, it is observed that frontal views outperform mid-profile views in both test conditions, while immersive VR shows a slight improvement in emotion recognition. This study provides novel insights into the influence of these factors on emotion perception and advances the understanding and application of these technologies for effective facial emotion recognition training.
Collapse
Affiliation(s)
- Miguel A Vicente-Querol
- Instituto de Investigación en Informática, Universidad de Castilla-La Mancha, Albacete 02071, Spain
| | - Antonio Fernández-Caballero
- Instituto de Investigación en Informática, Universidad de Castilla-La Mancha, Albacete 02071, Spain
- Departamento de Sistemas Informáticos, Universidad de Castilla-La Mancha, Albacete 02071, Spain
- Biomedical Research Networking Centre in Mental Health, Instituto de Salud Carlos III, Madrid 28029, Spain
| | - Pascual González
- Instituto de Investigación en Informática, Universidad de Castilla-La Mancha, Albacete 02071, Spain
- Departamento de Sistemas Informáticos, Universidad de Castilla-La Mancha, Albacete 02071, Spain
- Biomedical Research Networking Centre in Mental Health, Instituto de Salud Carlos III, Madrid 28029, Spain
| | - Luz M González-Gualda
- Servicio de Salud Mental, Complejo Hospitalario, Universitario de Albacete, Albacete 02004, Spain
| | - Patricia Fernández-Sotos
- Biomedical Research Networking Centre in Mental Health, Instituto de Salud Carlos III, Madrid 28029, Spain
- Servicio de Salud Mental, Complejo Hospitalario, Universitario de Albacete, Albacete 02004, Spain
| | - José P Molina
- Instituto de Investigación en Informática, Universidad de Castilla-La Mancha, Albacete 02071, Spain
- Departamento de Sistemas Informáticos, Universidad de Castilla-La Mancha, Albacete 02071, Spain
| | - Arturo S García
- Instituto de Investigación en Informática, Universidad de Castilla-La Mancha, Albacete 02071, Spain
- Departamento de Sistemas Informáticos, Universidad de Castilla-La Mancha, Albacete 02071, Spain
| |
Collapse
|
3
|
Abdullahi SB, Bature ZA, Gabralla LA, Chiroma H. Lie Recognition with Multi-Modal Spatial–Temporal State Transition Patterns Based on Hybrid Convolutional Neural Network–Bidirectional Long Short-Term Memory. Brain Sci 2023; 13:brainsci13040555. [PMID: 37190520 DOI: 10.3390/brainsci13040555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Revised: 03/14/2023] [Accepted: 03/22/2023] [Indexed: 03/29/2023] Open
Abstract
Recognition of lying is a more complex cognitive process than truth-telling because of the presence of involuntary cognitive cues that are useful to lie recognition. Researchers have proposed different approaches in the literature to solve the problem of lie recognition from either handcrafted and/or automatic lie features during court trials and police interrogations. Unfortunately, due to the cognitive complexity and the lack of involuntary cues related to lying features, the performances of these approaches suffer and their generalization ability is limited. To improve performance, this study proposed state transition patterns based on hands, body motions, and eye blinking features from real-life court trial videos. Each video frame is represented according to a computed threshold value among neighboring pixels to extract spatial–temporal state transition patterns (STSTP) of the hand and face poses as involuntary cues using fully connected convolution neural network layers optimized with the weights of ResNet-152 learning. In addition, this study computed an eye aspect ratio model to obtain eye blinking features. These features were fused together as a single multi-modal STSTP feature model. The model was built using the enhanced calculated weight of bidirectional long short-term memory. The proposed approach was evaluated by comparing its performance with current state-of-the-art methods. It was found that the proposed approach improves the performance of detecting lies.
Collapse
|
4
|
Avola D, Cascio M, Cinque L, Fagioli A, Foresti GL. Affective Action and Interaction Recognition by Multi-view Representation Learning from Handcrafted Low-level Skeleton Features. Int J Neural Syst 2022; 32:2250040. [DOI: 10.1142/s012906572250040x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
5
|
De Lope J, Graña M. A Hybrid Time-Distributed Deep Neural Architecture for Speech Emotion Recognition. Int J Neural Syst 2022; 32:2250024. [DOI: 10.1142/s0129065722500241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
In recent years, speech emotion recognition (SER) has emerged as one of the most active human–machine interaction research areas. Innovative electronic devices, services and applications are increasingly aiming to check the user emotional state either to issue alerts under some predefined conditions or to adapt the system responses to the user emotions. Voice expression is a very rich and noninvasive source of information for emotion assessment. This paper presents a novel SER approach based on that is a hybrid of a time-distributed convolutional neural network (TD-CNN) and a long short-term memory (LSTM) network. Mel-frequency log-power spectrograms (MFLPSs) extracted from audio recordings are parsed by a sliding window that selects the input for the TD-CNN. The TD-CNN transforms the input image data into a sequence of high-level features that are feed to the LSTM, which carries out the overall signal interpretation. In order to reduce overfitting, the MFLPS representation allows innovative image data augmentation techniques that have no immediate equivalent on the original audio signal. Validation of the proposed hybrid architecture achieves an average recognition accuracy of 73.98% on the most widely and hardest publicly distributed database for SER benchmarking. A permutation test confirms that this result is significantly different from random classification ([Formula: see text]). The proposed architecture outperforms state-of-the-art deep learning models as well as conventional machine learning techniques evaluated on the same database trying to identify the same number of emotions.
Collapse
Affiliation(s)
- Javier De Lope
- Department of Artificial Intelligence, Universidad Politécnica de Madrid (UPM), Madrid, Spain
| | - Manuel Graña
- Computational Intelligence Group, University of the Basque Country (UPV), San Sebastian, Spain
| |
Collapse
|
6
|
Vicente-Querol MA, Fernandez-Caballero A, Molina JP, Gonzalez-Gualda LM, Fernandez-Sotos P, Garcia AS. Facial Affect Recognition in Immersive Virtual Reality: Where Is the Participant Looking? Int J Neural Syst 2022; 32:2250029. [DOI: 10.1142/s0129065722500290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
7
|
Avola D, Cascio M, Cinque L, Fagioli A, Foresti GL. Human Silhouette and Skeleton Video Synthesis Through Wi-Fi signals. Int J Neural Syst 2022; 32:2250015. [PMID: 35209810 DOI: 10.1142/s0129065722500150] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The increasing availability of wireless access points (APs) is leading toward human sensing applications based on Wi-Fi signals as support or alternative tools to the widespread visual sensors, where the signals enable to address well-known vision-related problems such as illumination changes or occlusions. Indeed, using image synthesis techniques to translate radio frequencies to the visible spectrum can become essential to obtain otherwise unavailable visual data. This domain-to-domain translation is feasible because both objects and people affect electromagnetic waves, causing radio and optical frequencies variations. In the literature, models capable of inferring radio-to-visual features mappings have gained momentum in the last few years since frequency changes can be observed in the radio domain through the channel state information (CSI) of Wi-Fi APs, enabling signal-based feature extraction, e.g. amplitude. On this account, this paper presents a novel two-branch generative neural network that effectively maps radio data into visual features, following a teacher-student design that exploits a cross-modality supervision strategy. The latter conditions signal-based features in the visual domain to completely replace visual data. Once trained, the proposed method synthesizes human silhouette and skeleton videos using exclusively Wi-Fi signals. The approach is evaluated on publicly available data, where it obtains remarkable results for both silhouette and skeleton videos generation, demonstrating the effectiveness of the proposed cross-modality supervision strategy.
Collapse
Affiliation(s)
- Danilo Avola
- Department of Computer Science, Sapienza University of Rome Via Salaria, 113, Rome, 00198, Italy
| | - Marco Cascio
- Department of Computer Science, Sapienza University of Rome Via Salaria, 113, Rome, 00198, Italy
| | - Luigi Cinque
- Department of Computer Science, Sapienza University of Rome Via Salaria, 113, Rome, 00198, Italy
| | - Alessio Fagioli
- Department of Computer Science, Sapienza University of Rome Via Salaria, 113, Rome, 00198, Italy
| | - Gian Luca Foresti
- Department of Computer Science, Mathematics and Physics, University of Udine, Via delle Scienze 206, Udine, 33100, Italy
| |
Collapse
|
8
|
Jiménez P, Corchuelo R. An Experimental Study of Neural Approaches to Multi-Hop Inference in Question Answering. Int J Neural Syst 2022; 32:2250011. [PMID: 35172705 DOI: 10.1142/s0129065722500113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Question answering aims at computing the answer to a question given a context with facts. Many proposals focus on questions whose answer is explicit in the context; lately, there has been an increasing interest in questions whose answer is not explicit and requires multi-hop inference to be computed. Our analysis of the literature reveals that there is a seminal proposal with increasingly complex follow-ups. Unfortunately, they were presented without an extensive study of their hyper-parameters, the experimental studies focused exclusively on English, and no statistical analysis to sustain the conclusions was ever performed. In this paper, we report on our experience devising a very simple neural approach to address the problem, on our extensive grid search over the space of hyper-parameters, on the results attained with English, Spanish, Hindi, and Portuguese, and sustain our conclusions with statistically sound analyses. Our findings prove that it is possible to beat many of the proposals in the literature with a very simple approach that was likely overlooked due to the difficulty to perform an extensive grid search, that the language does not have a statistically significant impact on the results, and that the empirical differences found among some existing proposals are not statistically significant.
Collapse
Affiliation(s)
- Patricia Jiménez
- Universidad de Sevilla, ETSI Informática, Avda. de la Reina Mercedes, s/n. Sevilla E-41012, Spain
| | - Rafael Corchuelo
- Universidad de Sevilla, ETSI Informática, Avda. de la Reina Mercedes, s/n. Sevilla E-41012, Spain
| |
Collapse
|
9
|
Low-Altitude Aerial Video Surveillance via One-Class SVM Anomaly Detection from Textural Features in UAV Images. INFORMATION 2021. [DOI: 10.3390/info13010002] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
In recent years, small-scale Unmanned Aerial Vehicles (UAVs) have been used in many video surveillance applications, such as vehicle tracking, border control, dangerous object detection, and many others. Anomaly detection can represent a prerequisite of many of these applications thanks to its ability to identify areas and/or objects of interest without knowing them a priori. In this paper, a One-Class Support Vector Machine (OC-SVM) anomaly detector based on customized Haralick textural features for aerial video surveillance at low-altitude is presented. The use of a One-Class SVM, which is notoriously a lightweight and fast classifier, enables the implementation of real-time systems even when these are embedded in low-computational small-scale UAVs. At the same time, the use of textural features allows a vision-based system to detect micro and macro structures of an analyzed surface, thus allowing the identification of small and large anomalies, respectively. The latter aspect plays a key role in aerial video surveillance at low-altitude, i.e., 6 to 15 m, where the detection of common items, e.g., cars, is as important as the detection of little and undefined objects, e.g., Improvised Explosive Devices (IEDs). Experiments obtained on the UAV Mosaicking and Change Detection (UMCD) dataset show the effectiveness of the proposed system in terms of accuracy, precision, recall, and F1-score, where the model achieves a 100% precision, i.e., never misses an anomaly, but at the expense of a reasonable trade-off in its recall, which still manages to reach up to a 71.23% score. Moreover, when compared to classical Haralick textural features, the model obtains significantly higher performances, i.e., ≈20% on all metrics, further demonstrating the approach effectiveness.
Collapse
|