1
|
Moslemi C, Sækmose S, Larsen R, Brodersen T, Bay JT, Didriksen M, Nielsen KR, Bruun MT, Dowsett J, Dinh KM, Mikkelsen C, Hyvärinen K, Ritari J, Partanen J, Ullum H, Erikstrup C, Ostrowski SR, Olsson ML, Pedersen OB. A deep learning approach to prediction of blood group antigens from genomic data. Transfusion 2024. [PMID: 39268576 DOI: 10.1111/trf.18013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 07/17/2024] [Accepted: 08/27/2024] [Indexed: 09/17/2024]
Abstract
BACKGROUND Deep learning methods are revolutionizing natural science. In this study, we aim to apply such techniques to develop blood type prediction models based on cheap to analyze and easily scalable screening array genotyping platforms. METHODS Combining existing blood types from blood banks and imputed screening array genotypes for ~111,000 Danish and 1168 Finnish blood donors, we used deep learning techniques to train and validate blood type prediction models for 36 antigens in 15 blood group systems. To account for missing genotypes a denoising autoencoder initial step was utilized, followed by a convolutional neural network blood type classifier. RESULTS Two thirds of the trained blood type prediction models demonstrated an F1-accuracy above 99%. Models for antigens with low or high frequencies like, for example, Cw, low training cohorts like, for example, Cob, or very complicated genetic underpinning like, for example, RhD, proved to be more challenging for high accuracy (>99%) DL modeling. However, in the Danish cohort only 4 out of 36 models (Cob, Cw, D-weak, Kpa) failed to achieve a prediction F1-accuracy above 97%. This high predictive performance was replicated in the Finnish cohort. DISCUSSION High accuracy in a variety of blood groups proves viability of deep learning-based blood type prediction using array chip genotypes, even in blood groups with nontrivial genetic underpinnings. These techniques are suitable for aiding in identifying blood donors with rare blood types by greatly narrowing down the potential pool of candidate donors before clinical grade confirmation.
Collapse
Affiliation(s)
- Camous Moslemi
- Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark
- Institute of Science and Environment, Roskilde University, Roskilde, Denmark
| | - Susanne Sækmose
- Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark
| | - Rune Larsen
- Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark
| | - Thorsten Brodersen
- Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark
| | - Jakob T Bay
- Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark
| | - Maria Didriksen
- Department of Clinical Immunology, Copenhagen University Hospital, Rigshopitalet, Copenhagen, Denmark
| | - Kaspar R Nielsen
- Department of Clinical Immunology, Aalborg University Hospital, Aalborg, Denmark
| | - Mie T Bruun
- Department of Clinical Immunology, Odense University Hospital, Odense, Denmark
| | - Joseph Dowsett
- Department of Clinical Immunology, Copenhagen University Hospital, Rigshopitalet, Copenhagen, Denmark
| | - Khoa M Dinh
- Department of Clinical Immunology, Aarhus University Hospital, Aarhus, Denmark
| | - Christina Mikkelsen
- Department of Clinical Immunology, Copenhagen University Hospital, Rigshopitalet, Copenhagen, Denmark
| | | | - Jarmo Ritari
- Finnish Red Cross Blood Service, Helsinki, Finland
| | | | | | - Christian Erikstrup
- Department of Clinical Immunology, Aarhus University Hospital, Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | - Sisse R Ostrowski
- Department of Clinical Immunology, Copenhagen University Hospital, Rigshopitalet, Copenhagen, Denmark
- Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Martin L Olsson
- Department of Laboratory Medicine, Lund University, Lund, Sweden
- Department of Clinical Immunology and Transfusion, Office for Medical Services, Region Skåne, Sweden
| | - Ole B Pedersen
- Department of Clinical Immunology, Zealand University Hospital, Køge, Denmark
- Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
2
|
Begazo R, Aguilera A, Dongo I, Cardinale Y. A Combined CNN Architecture for Speech Emotion Recognition. SENSORS (BASEL, SWITZERLAND) 2024; 24:5797. [PMID: 39275707 PMCID: PMC11398044 DOI: 10.3390/s24175797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 09/01/2024] [Accepted: 09/02/2024] [Indexed: 09/16/2024]
Abstract
Emotion recognition through speech is a technique employed in various scenarios of Human-Computer Interaction (HCI). Existing approaches have achieved significant results; however, limitations persist, with the quantity and diversity of data being more notable when deep learning techniques are used. The lack of a standard in feature selection leads to continuous development and experimentation. Choosing and designing the appropriate network architecture constitutes another challenge. This study addresses the challenge of recognizing emotions in the human voice using deep learning techniques, proposing a comprehensive approach, and developing preprocessing and feature selection stages while constructing a dataset called EmoDSc as a result of combining several available databases. The synergy between spectral features and spectrogram images is investigated. Independently, the weighted accuracy obtained using only spectral features was 89%, while using only spectrogram images, the weighted accuracy reached 90%. These results, although surpassing previous research, highlight the strengths and limitations when operating in isolation. Based on this exploration, a neural network architecture composed of a CNN1D, a CNN2D, and an MLP that fuses spectral features and spectogram images is proposed. The model, supported by the unified dataset EmoDSc, demonstrates a remarkable accuracy of 96%.
Collapse
Affiliation(s)
- Rolinson Begazo
- Electrical and Electronics Engineering Department, Universidad Católica San Pablo, Arequipa 04001, Peru
| | - Ana Aguilera
- Escuela de Ingeniería Informática, Facultad de Ingeniería, Universidad de Valparaíso, Valparaíso 2340000, Chile
- Interdisciplinary Center for Biomedical Research and Health Engineering "MEDING", Universidad de Valparaíso, Valparaíso 2340000, Chile
| | - Irvin Dongo
- Electrical and Electronics Engineering Department, Universidad Católica San Pablo, Arequipa 04001, Peru
- ESTIA Institute of Technology, University Bordeaux, 64210 Bidart, France
| | - Yudith Cardinale
- Grupo de Investigación en Ciencia de Datos, Universidad Internacional de Valencia, 46002 Valencia, Spain
| |
Collapse
|
3
|
Akinpelu S, Viriri S, Adegun A. An enhanced speech emotion recognition using vision transformer. Sci Rep 2024; 14:13126. [PMID: 38849422 PMCID: PMC11161461 DOI: 10.1038/s41598-024-63776-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 06/02/2024] [Indexed: 06/09/2024] Open
Abstract
In human-computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users' emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model's capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.
Collapse
Affiliation(s)
- Samson Akinpelu
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa
| | - Serestina Viriri
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa.
| | - Adekanmi Adegun
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa
| |
Collapse
|
4
|
Castro-Ospina AE, Solarte-Sanchez MA, Vega-Escobar LS, Isaza C, Martínez-Vargas JD. Graph-Based Audio Classification Using Pre-Trained Models and Graph Neural Networks. SENSORS (BASEL, SWITZERLAND) 2024; 24:2106. [PMID: 38610318 PMCID: PMC11014159 DOI: 10.3390/s24072106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 03/22/2024] [Accepted: 03/23/2024] [Indexed: 04/14/2024]
Abstract
Sound classification plays a crucial role in enhancing the interpretation, analysis, and use of acoustic data, leading to a wide range of practical applications, of which environmental sound analysis is one of the most important. In this paper, we explore the representation of audio data as graphs in the context of sound classification. We propose a methodology that leverages pre-trained audio models to extract deep features from audio files, which are then employed as node information to build graphs. Subsequently, we train various graph neural networks (GNNs), specifically graph convolutional networks (GCNs), GraphSAGE, and graph attention networks (GATs), to solve multi-class audio classification problems. Our findings underscore the effectiveness of employing graphs to represent audio data. Moreover, they highlight the competitive performance of GNNs in sound classification endeavors, with the GAT model emerging as the top performer, achieving a mean accuracy of 83% in classifying environmental sounds and 91% in identifying the land cover of a site based on its audio recording. In conclusion, this study provides novel insights into the potential of graph representation learning techniques for analyzing audio data.
Collapse
Affiliation(s)
- Andrés Eduardo Castro-Ospina
- Grupo de Investigación Máquinas Inteligentes y Reconocimiento de Patrones, Instituto Tecnológico Metropolitano, Medellín 050013, Colombia
| | - Miguel Angel Solarte-Sanchez
- Grupo de Investigación Máquinas Inteligentes y Reconocimiento de Patrones, Instituto Tecnológico Metropolitano, Medellín 050013, Colombia
| | - Laura Stella Vega-Escobar
- Grupo de Investigación Máquinas Inteligentes y Reconocimiento de Patrones, Instituto Tecnológico Metropolitano, Medellín 050013, Colombia
| | - Claudia Isaza
- SISTEMIC, Electronic Engineering Department, Universidad de Antioquia-UdeA, Medellín 050010, Colombia
| | | |
Collapse
|
5
|
Diemerling H, Stresemann L, Braun T, von Oertzen T. Implementing machine learning techniques for continuous emotion prediction from uniformly segmented voice recordings. Front Psychol 2024; 15:1300996. [PMID: 38572198 PMCID: PMC10987695 DOI: 10.3389/fpsyg.2024.1300996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Accepted: 02/09/2024] [Indexed: 04/05/2024] Open
Abstract
Introduction Emotional recognition from audio recordings is a rapidly advancing field, with significant implications for artificial intelligence and human-computer interaction. This study introduces a novel method for detecting emotions from short, 1.5 s audio samples, aiming to improve accuracy and efficiency in emotion recognition technologies. Methods We utilized 1,510 unique audio samples from two databases in German and English to train our models. We extracted various features for emotion prediction, employing Deep Neural Networks (DNN) for general feature analysis, Convolutional Neural Networks (CNN) for spectrogram analysis, and a hybrid model combining both approaches (C-DNN). The study addressed challenges associated with dataset heterogeneity, language differences, and the complexities of audio sample trimming. Results Our models demonstrated accuracy significantly surpassing random guessing, aligning closely with human evaluative benchmarks. This indicates the effectiveness of our approach in recognizing emotional states from brief audio clips. Discussion Despite the challenges of integrating diverse datasets and managing short audio samples, our findings suggest considerable potential for this methodology in real-time emotion detection from continuous speech. This could contribute to improving the emotional intelligence of AI and its applications in various areas.
Collapse
Affiliation(s)
- Hannes Diemerling
- Center for Lifespan Psychology, Max Planck Institute for Human Development, Berlin, Germany
- Thomas Bayes Institute, Berlin, Germany
- Department of Psychology, Humboldt-Universität zu Berlin, Berlin, Germany
- Department of Psychology, University of the Bundeswehr München, Neubiberg, Germany
| | - Leonie Stresemann
- Department of Psychology, University of the Bundeswehr München, Neubiberg, Germany
| | - Tina Braun
- Department of Psychology, University of the Bundeswehr München, Neubiberg, Germany
- Department of Psychology, Charlotte-Fresenius University, Wiesbaden, Germany
| | - Timo von Oertzen
- Center for Lifespan Psychology, Max Planck Institute for Human Development, Berlin, Germany
- Thomas Bayes Institute, Berlin, Germany
| |
Collapse
|
6
|
Pentari A, Kafentzis G, Tsiknakis M. Speech emotion recognition via graph-based representations. Sci Rep 2024; 14:4484. [PMID: 38396002 PMCID: PMC10891082 DOI: 10.1038/s41598-024-52989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 01/25/2024] [Indexed: 02/25/2024] Open
Abstract
Speech emotion recognition (SER) has gained an increased interest during the last decades as part of enriched affective computing. As a consequence, a variety of engineering approaches have been developed addressing the challenge of the SER problem, exploiting different features, learning algorithms, and datasets. In this paper, we propose the application of the graph theory for classifying emotionally-colored speech signals. Graph theory provides tools for extracting statistical as well as structural information from any time series. We propose to use the mentioned information as a novel feature set. Furthermore, we suggest setting a unique feature-based identity for each emotion belonging to each speaker. The emotion classification is performed by a Random Forest classifier in a Leave-One-Speaker-Out Cross Validation (LOSO-CV) scheme. The proposed method is compared with two state-of-the-art approaches involving well known hand-crafted features as well as deep learning architectures operating on mel-spectrograms. Experimental results on three datasets, EMODB (German, acted) and AESDD (Greek, acted), and DEMoS (Italian, in-the-wild), reveal that our proposed method outperforms the comparative methods in these datasets. Specifically, we observe an average UAR increase of almost [Formula: see text], [Formula: see text] and [Formula: see text], respectively.
Collapse
Affiliation(s)
- Anastasia Pentari
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, GR-700 13, Greece.
| | - George Kafentzis
- Computer Science Department, University of Crete, Heraklion, GR-700 13, Greece
| | - Manolis Tsiknakis
- Institute of Computer Science, Foundation for Research and Technology-Hellas, Heraklion, GR-700 13, Greece
- Department of Electrical and Computer Engineering, Hellenic Mediterranean University, Heraklion, Greece
| |
Collapse
|
7
|
Matt JE, Rizzo DM, Javed A, Eppstein MJ, Manukyan V, Gramling C, Dewoolkar AM, Gramling R. An Acoustical and Lexical Machine-Learning Pipeline to Identify Connectional Silences. J Palliat Med 2023; 26:1627-1633. [PMID: 37440175 DOI: 10.1089/jpm.2023.0087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/14/2023] Open
Abstract
Context: Developing scalable methods for conversation analytics is essential for health care communication science and quality improvement. Purpose: To assess the feasibility of automating the identification of a conversational feature, Connectional Silence, which is associated with important patient outcomes. Methods: Using audio recordings from the Palliative Care Communication Research Initiative cohort study, we develop and test an automated measurement pipeline comprising three machine-learning (ML) tools-a random forest algorithm and a custom convolutional neural network that operate in parallel on audio recordings, and subsequently a natural language processing algorithm that uses brief excerpts of automated speech-to-text transcripts. Results: Our ML pipeline identified Connectional Silence with an overall sensitivity of 84% and specificity of 92%. For Emotional and Invitational subtypes, we observed sensitivities of 68% and 67%, and specificities of 95% and 97%, respectively. Conclusion: These findings support the capacity for coordinated and complementary ML methods to fully automate the identification of Connectional Silence in natural hospital-based clinical conversations.
Collapse
Affiliation(s)
- Jeremy E Matt
- Graduate Program in Complex Systems and Data Science, College of Engineering and Mathematical Sciences, University of Vermont, Burlington, Vermont, USA
| | - Donna M Rizzo
- Department of Civil and Environmental Engineering, University of Vermont, Burlington, Vermont, USA
| | - Ali Javed
- Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Stanford University, Stanford, California, USA
| | - Margaret J Eppstein
- Department of Computer Science, University of Vermont, Burlington, Vermont, USA
| | | | - Cailin Gramling
- Graduate Program in Complex Systems and Data Science, College of Engineering and Mathematical Sciences, University of Vermont, Burlington, Vermont, USA
| | - Advik Mandar Dewoolkar
- Department of Electrical and Biomedical Engineering, University of Vermont, Burlington, Vermont, USA
| | - Robert Gramling
- Department of Family Medicine, University of Vermont, Burlington, Vermont, USA
| |
Collapse
|
8
|
Ziyadinov V, Tereshonok M. Low-Pass Image Filtering to Achieve Adversarial Robustness. SENSORS (BASEL, SWITZERLAND) 2023; 23:9032. [PMID: 38005420 PMCID: PMC10675189 DOI: 10.3390/s23229032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 11/01/2023] [Accepted: 11/01/2023] [Indexed: 11/26/2023]
Abstract
In this paper, we continue the research cycle on the properties of convolutional neural network-based image recognition systems and ways to improve noise immunity and robustness. Currently, a popular research area related to artificial neural networks is adversarial attacks. The adversarial attacks on the image are not highly perceptible to the human eye, and they also drastically reduce the neural network's accuracy. Image perception by a machine is highly dependent on the propagation of high frequency distortions throughout the network. At the same time, a human efficiently ignores high-frequency distortions, perceiving the shape of objects as a whole. We propose a technique to reduce the influence of high-frequency noise on the CNNs. We show that low-pass image filtering can improve the image recognition accuracy in the presence of high-frequency distortions in particular, caused by adversarial attacks. This technique is resource efficient and easy to implement. The proposed technique makes it possible to measure up the logic of an artificial neural network to that of a human, for whom high-frequency distortions are not decisive in object recognition.
Collapse
Affiliation(s)
- Vadim Ziyadinov
- Science and Research Department, Moscow Technical University of Communications and Informatics, 111024 Moscow, Russia;
| | - Maxim Tereshonok
- Science and Research Department, Moscow Technical University of Communications and Informatics, 111024 Moscow, Russia;
- Skobeltsyn Institute of Nuclear Physics (SINP MSU), Lomonosov Moscow State University, 119991 Moscow, Russia
| |
Collapse
|
9
|
Ullah R, Asif M, Shah WA, Anjam F, Ullah I, Khurshaid T, Wuttisittikulkij L, Shah S, Ali SM, Alibakhshikenari M. Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer. SENSORS (BASEL, SWITZERLAND) 2023; 23:6212. [PMID: 37448062 DOI: 10.3390/s23136212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 05/26/2023] [Accepted: 06/04/2023] [Indexed: 07/15/2023]
Abstract
Speech emotion recognition (SER) is a challenging task in human-computer interaction (HCI) systems. One of the key challenges in speech emotion recognition is to extract the emotional features effectively from a speech utterance. Despite the promising results of recent studies, they generally do not leverage advanced fusion algorithms for the generation of effective representations of emotional features in speech utterances. To address this problem, we describe the fusion of spatial and temporal feature representations of speech emotion by parallelizing convolutional neural networks (CNNs) and a Transformer encoder for SER. We stack two parallel CNNs for spatial feature representation in parallel to a Transformer encoder for temporal feature representation, thereby simultaneously expanding the filter depth and reducing the feature map with an expressive hierarchical feature representation at a lower computational cost. We use the RAVDESS dataset to recognize eight different speech emotions. We augment and intensify the variations in the dataset to minimize model overfitting. Additive White Gaussian Noise (AWGN) is used to augment the RAVDESS dataset. With the spatial and sequential feature representations of CNNs and the Transformer, the SER model achieves 82.31% accuracy for eight emotions on a hold-out dataset. In addition, the SER system is evaluated with the IEMOCAP dataset and achieves 79.42% recognition accuracy for five emotions. Experimental results on the RAVDESS and IEMOCAP datasets show the success of the presented SER system and demonstrate an absolute performance improvement over the state-of-the-art (SOTA) models.
Collapse
Affiliation(s)
- Rizwan Ullah
- Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand
| | - Muhammad Asif
- Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu 28100, Pakistan
| | - Wahab Ali Shah
- Department of Electrical Engineering, Namal University, Mianwali 42250, Pakistan
| | - Fakhar Anjam
- Department of Electrical Engineering, Main Campus, University of Science & Technology, Bannu 28100, Pakistan
| | - Ibrar Ullah
- Department of Electrical Engineering, Kohat Campus, University of Engineering and Technology Peshawar, Kohat 25000, Pakistan
| | - Tahir Khurshaid
- Department of Electrical Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea
| | - Lunchakorn Wuttisittikulkij
- Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand
| | - Shashi Shah
- Wireless Communication Ecosystem Research Unit, Department of Electrical Engineering, Chulalongkorn University, Bangkok 10330, Thailand
| | - Syed Mansoor Ali
- Department of Physics and Astronomy, College of Science, King Saud University, P.O. Box 2455, Riyadh 11451, Saudi Arabia
| | - Mohammad Alibakhshikenari
- Department of Signal Theory and Communications, Universidad Carlos III de Madrid, Leganés, 28911 Madrid, Spain
| |
Collapse
|
10
|
Mang L, Canadas-Quesada F, Carabias-Orti J, Combarro E, Ranilla J. Cochleogram-based adventitious sounds classification using convolutional neural networks. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
|
11
|
Aspect-Based Sentiment Analysis of Customer Speech Data Using Deep Convolutional Neural Network and BiLSTM. Cognit Comput 2023. [DOI: 10.1007/s12559-023-10127-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/08/2023]
|
12
|
Rajput V, Mulay P, Pandya S, Mahajan C, Deshpande R. Blood Pressure Estimation Using Emotion-Based Optimization Clustering Model. ACTA INFORMATICA PRAGENSIA 2023. [DOI: 10.18267/j.aip.209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023] Open
|
13
|
Zhang LM, Li Y, Zhang YT, Ng GW, Leau YB, Yan H. A Deep Learning Method Using Gender-Specific Features for Emotion Recognition. SENSORS (BASEL, SWITZERLAND) 2023; 23:1355. [PMID: 36772395 PMCID: PMC9921859 DOI: 10.3390/s23031355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Revised: 01/20/2023] [Accepted: 01/22/2023] [Indexed: 06/18/2023]
Abstract
Speech reflects people's mental state and using a microphone sensor is a potential method for human-computer interaction. Speech recognition using this sensor is conducive to the diagnosis of mental illnesses. The gender difference of speakers affects the process of speech emotion recognition based on specific acoustic features, resulting in the decline of emotion recognition accuracy. Therefore, we believe that the accuracy of speech emotion recognition can be effectively improved by selecting different features of speech for emotion recognition based on the speech representations of different genders. In this paper, we propose a speech emotion recognition method based on gender classification. First, we use MLP to classify the original speech by gender. Second, based on the different acoustic features of male and female speech, we analyze the influence weights of multiple speech emotion features in male and female speech, and establish the optimal feature sets for male and female emotion recognition, respectively. Finally, we train and test CNN and BiLSTM, respectively, by using the male and the female speech emotion feature sets. The results show that the proposed emotion recognition models have an advantage in terms of average recognition accuracy compared with gender-mixed recognition models.
Collapse
Affiliation(s)
- Li-Min Zhang
- Key Laboratory for Artificial Intelligence and Cognitive Neuroscience of Language, Xi’an International Studies University, Xi’an 610116, China
- Faculty of Computing and Informatics, Universiti Malaysia Sabah, Sabah 88400, Malaysia
| | - Yang Li
- Key Laboratory for Artificial Intelligence and Cognitive Neuroscience of Language, Xi’an International Studies University, Xi’an 610116, China
| | - Yue-Ting Zhang
- Key Laboratory for Artificial Intelligence and Cognitive Neuroscience of Language, Xi’an International Studies University, Xi’an 610116, China
| | - Giap Weng Ng
- Faculty of Computing and Informatics, Universiti Malaysia Sabah, Sabah 88400, Malaysia
| | - Yu-Beng Leau
- Faculty of Computing and Informatics, Universiti Malaysia Sabah, Sabah 88400, Malaysia
| | - Hao Yan
- Key Laboratory for Artificial Intelligence and Cognitive Neuroscience of Language, Xi’an International Studies University, Xi’an 610116, China
| |
Collapse
|
14
|
Wang X, He X, Wei J, Liu J, Li Y, Liu X. Application of artificial intelligence to the public health education. Front Public Health 2023; 10:1087174. [PMID: 36703852 PMCID: PMC9872201 DOI: 10.3389/fpubh.2022.1087174] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 12/15/2022] [Indexed: 01/11/2023] Open
Abstract
With the global outbreak of coronavirus disease 2019 (COVID-19), public health has received unprecedented attention. The cultivation of emergency and compound professionals is the general trend through public health education. However, current public health education is limited to traditional teaching models that struggle to balance theory and practice. Fortunately, the development of artificial intelligence (AI) has entered the stage of intelligent cognition. The introduction of AI in education has opened a new era of computer-assisted education, which brought new possibilities for teaching and learning in public health education. AI-based on big data not only provides abundant resources for public health research and management but also brings convenience for students to obtain public health data and information, which is conducive to the construction of introductory professional courses for students. In this review, we elaborated on the current status and limitations of public health education, summarized the application of AI in public health practice, and further proposed a framework for how to integrate AI into public health education curriculum. With the rapid technological advancements, we believe that AI will revolutionize the education paradigm of public health and help respond to public health emergencies.
Collapse
Affiliation(s)
- Xueyan Wang
- Laboratory of Integrative Medicine, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| | - Xiujing He
- Laboratory of Integrative Medicine, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| | - Jiawei Wei
- Research Center for Nano-Biomaterials, Analytical and Testing Center, Sichuan University, Chengdu, Sichuan, China
| | - Jianping Liu
- The First People's Hospital of Yibin, Yibin, Sichuan, China
| | - Yuanxi Li
- Laboratory of Integrative Medicine, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| | - Xiaowei Liu
- Laboratory of Integrative Medicine, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University, Chengdu, Sichuan, China
| |
Collapse
|
15
|
Tejaswini V, Babu KS, Sahoo B. Depression Detection from Social Media Text Analysis using Natural Language Processing Techniques and Hybrid Deep Learning Model. ACM T ASIAN LOW-RESO 2022. [DOI: 10.1145/3569580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Depression is a kind of emotion that negatively impacts people's daily lives. The number of people suffering from long-term feelings is increasing every year across the globe. Depressed patients may engage in self-harm behaviors, which occasionally result in suicide. Many psychiatrists struggle to identify the presence of mental illness or negative emotion early to provide a better course of treatment before they reach a critical stage. One of the most challenging problems is detecting depression in people at the earliest possible stage. Researchers are using Natural Language Processing (NLP) techniques to analyze text content uploaded on social media, which helps to design approaches for detecting depression. This work analyses numerous prior studies that used learning techniques to identify depression. The existing methods suffer from better model representation problems to detect depression from the text with high accuracy. The present work addresses a solution to these problems by creating a new hybrid deep learning neural network design with better text representations called "Fasttext Convolution Neural Network with Long Short-Term Memory (FCL)." In addition, this work utilizes the advantage of NLP to simplify the text analysis during the model development. The FCL model comprises fasttext embedding for better text representation considering out-of-vocabulary (OOV) with semantic information, a convolution neural network (CNN) architecture to extract global information, and Long Short-Term Memory (LSTM) architecture to extract local features with dependencies. The present work was implemented on real-world datasets utilized in the literature. The proposed technique provides better results than the state-of-the-art to detect depression with high accuracy.
Collapse
Affiliation(s)
- Vankayala Tejaswini
- Computer Science and Engineering, National Institute of Technology Rourkela, Odisha, India
| | - Korra Sathya Babu
- Computer Science and Engineering, Indian Institute of Information Technology Design and Manufacturing, Kurnool, Andhra Pradesh, India
| | - Bibhudatta Sahoo
- Computer Science and Engineering, National Institute of Technology Rourkela, Odisha, India
| |
Collapse
|
16
|
Recognition of Emotion with Intensity from Speech Signal Using 3D Transformed Feature and Deep Learning. ELECTRONICS 2022. [DOI: 10.3390/electronics11152362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Speech Emotion Recognition (SER), the extraction of emotional features with the appropriate classification from speech signals, has recently received attention for its emerging social applications. Emotional intensity (e.g., Normal, Strong) for a particular emotional expression (e.g., Sad, Angry) has a crucial influence on social activities. A person with intense sadness or anger may fall into severe disruptive action, eventually triggering a suicidal or devastating act. However, existing Deep Learning (DL)-based SER models only consider the categorization of emotion, ignoring the respective emotional intensity, despite its utmost importance. In this study, a novel scheme for Recognition of Emotion with Intensity from Speech (REIS) is developed using the DL model by integrating three speech signal transformation methods, namely Mel-frequency Cepstral Coefficient (MFCC), Short-time Fourier Transform (STFT), and Chroma STFT. The integrated 3D form of transformed features from three individual methods is fed into the DL model. Moreover, under the proposed REIS, both the single and cascaded frameworks with DL models are investigated. A DL model consists of a 3D Convolutional Neural Network (CNN), Time Distribution Flatten (TDF) layer, and Bidirectional Long Short-term Memory (Bi-LSTM) network. The 3D CNN block extracts convolved features from 3D transformed speech features. The convolved features were flattened through the TDF layer and fed into Bi-LSTM to classify emotion with intensity in a single DL framework. The 3D transformed feature is first classified into emotion categories in the cascaded DL framework using a DL model. Then, using a different DL model, the intensity level of the identified categories is determined. The proposed REIS has been evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) benchmark dataset, and the cascaded DL framework is found to be better than the single DL framework. The proposed REIS method has shown remarkable recognition accuracy, outperforming related existing methods.
Collapse
|
17
|
|
18
|
Lee MC, Yeh SC, Chang JW, Chen ZY. Research on Chinese Speech Emotion Recognition Based on Deep Neural Network and Acoustic Features. SENSORS 2022; 22:s22134744. [PMID: 35808238 PMCID: PMC9269147 DOI: 10.3390/s22134744] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Revised: 06/16/2022] [Accepted: 06/20/2022] [Indexed: 01/25/2023]
Abstract
In recent years, the use of Artificial Intelligence for emotion recognition has attracted much attention. The industrial applicability of emotion recognition is quite comprehensive and has good development potential. This research uses voice emotion recognition technology to apply it to Chinese speech emotion recognition. The main purpose of this research is to transform gradually popularized smart home voice assistants or AI system service robots from a touch-sensitive interface to a voice operation. This research proposed a specifically designed Deep Neural Network (DNN) model to develop a Chinese speech emotion recognition system. In this research, 29 acoustic characteristics in acoustic theory are used as the training attributes of the proposed model. This research also proposes a variety of audio adjustment methods to amplify datasets and enhance training accuracy, including waveform adjustment, pitch adjustment, and pre-emphasize. This study achieved an average emotion recognition accuracy of 88.9% in the CASIA Chinese sentiment corpus. The results show that the deep learning model and audio adjustment method proposed in this study can effectively identify the emotions of Chinese short sentences and can be applied to Chinese voice assistants or integrated with other dialogue applications.
Collapse
Affiliation(s)
- Ming-Che Lee
- Department of Computer and Communication Engineering, Ming Chuan University, Taoyuan 333, Taiwan; (M.-C.L.); (S.-C.Y.); (Z.-Y.C.)
| | - Sheng-Cheng Yeh
- Department of Computer and Communication Engineering, Ming Chuan University, Taoyuan 333, Taiwan; (M.-C.L.); (S.-C.Y.); (Z.-Y.C.)
| | - Jia-Wei Chang
- Department of Computer Science and Information Engineering, National Taichung University of Science and Technology, Taichung City 404, Taiwan
- Correspondence:
| | - Zhen-Yi Chen
- Department of Computer and Communication Engineering, Ming Chuan University, Taoyuan 333, Taiwan; (M.-C.L.); (S.-C.Y.); (Z.-Y.C.)
| |
Collapse
|
19
|
Advanced Fusion-Based Speech Emotion Recognition System Using a Dual-Attention Mechanism with Conv-Caps and Bi-GRU Features. ELECTRONICS 2022. [DOI: 10.3390/electronics11091328] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Recognizing the speaker’s emotional state from speech signals plays a very crucial role in human–computer interaction (HCI). Nowadays, numerous linguistic resources are available, but most of them contain samples of a discrete length. In this article, we address the leading challenge in Speech Emotion Recognition (SER), which is how to extract the essential emotional features from utterances of a variable length. To obtain better emotional information from the speech signals and increase the diversity of the information, we present an advanced fusion-based dual-channel self-attention mechanism using convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU) networks. We extracted six spectral features (Mel-spectrograms, Mel-frequency cepstral coefficients, chromagrams, the contrast, the zero-crossing rate, and the root mean square). The Conv-Cap module was used to obtain Mel-spectrograms, while the Bi-GRU was used to obtain the rest of the spectral features from the input tensor. The self-attention layer was employed in each module to selectively focus on optimal cues and determine the attention weight to yield high-level features. Finally, we utilized a confidence-based fusion method to fuse all high-level features and pass them through the fully connected layers to classify the emotional states. The proposed model was evaluated on the Berlin (EMO-DB), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and Odia (SITB-OSED) datasets to improve the recognition rate. During experiments, we found that our proposed model achieved high weighted accuracy (WA) and unweighted accuracy (UA) values, i.e., 90.31% and 87.61%, 76.84% and 70.34%, and 87.52% and 86.19%, respectively, demonstrating that the proposed model outperformed the state-of-the-art models using the same datasets.
Collapse
|
20
|
Improved Security of E-Healthcare Images Using Hybridized Robust Zero-Watermarking and Hyper-Chaotic System along with RSA. MATHEMATICS 2022. [DOI: 10.3390/math10071071] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
With the rapid advancements of the internet of things (IoT), several applications have evolved with completely dissimilar structures and requirements. However, the fifth generation of mobile cellular networks (5G) is unable to successfully support the dissimilar structures and requirements. The sixth generation of mobile cellular networks (6G) is likely to enable new and unidentified applications with varying requirements. Therefore, 6G not only provides 10 to 100 times the speed of 5G, but 6G can also provide dynamic services for advanced IoT applications. However, providing security to 6G networks is still a significant problem. Therefore, in this paper, a hybrid image encryption technique is proposed to secure multimedia data communication over 6G networks. Initially, multimedia data are encrypted by using the proposed model. Thereafter, the encrypted data are then transferred over the 6G networks. Extensive experiments are conducted by using various attacks and security measures. A comparative analysis reveals that the proposed model achieves remarkably good performance as compared to the existing encryption techniques.
Collapse
|
21
|
Fathi Y, Erfanian A. Decoding Bilateral Hindlimb Kinematics From Cat Spinal Signals Using Three-Dimensional Convolutional Neural Network. Front Neurosci 2022; 16:801818. [PMID: 35401098 PMCID: PMC8990134 DOI: 10.3389/fnins.2022.801818] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Accepted: 03/02/2022] [Indexed: 11/13/2022] Open
Abstract
To date, decoding limb kinematic information mostly relies on neural signals recorded from the peripheral nerve, dorsal root ganglia (DRG), ventral roots, spinal cord gray matter, and the sensorimotor cortex. In the current study, we demonstrated that the neural signals recorded from the lateral and dorsal columns within the spinal cord have the potential to decode hindlimb kinematics during locomotion. Experiments were conducted using intact cats. The cats were trained to walk on a moving belt in a hindlimb-only condition, while their forelimbs were kept on the front body of the treadmill. The bilateral hindlimb joint angles were decoded using local field potential signals recorded using a microelectrode array implanted in the dorsal and lateral columns of both the left and right sides of the cat spinal cord. The results show that contralateral hindlimb kinematics can be decoded as accurately as ipsilateral kinematics. Interestingly, hindlimb kinematics of both legs can be accurately decoded from the lateral columns within one side of the spinal cord during hindlimb-only locomotion. The results indicated that there was no significant difference between the decoding performances obtained using neural signals recorded from the dorsal and lateral columns. The results of the time-frequency analysis show that event-related synchronization (ERS) and event-related desynchronization (ERD) patterns in all frequency bands could reveal the dynamics of the neural signals during movement. The onset and offset of the movement can be clearly identified by the ERD/ERS patterns. The results of the mutual information (MI) analysis showed that the theta frequency band contained significantly more limb kinematics information than the other frequency bands. Moreover, the theta power increased with a higher locomotion speed.
Collapse
Affiliation(s)
- Yaser Fathi
- Department of Biomedical Engineering, School of Electrical Engineering, Iran Neural Technology Research Centre, Iran University of Science and Technology, Tehran, Iran
| | - Abbas Erfanian
- Department of Biomedical Engineering, School of Electrical Engineering, Iran Neural Technology Research Centre, Iran University of Science and Technology, Tehran, Iran
- School of Cognitive Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- *Correspondence: Abbas Erfanian,
| |
Collapse
|
22
|
Speaker Recognition Using Constrained Convolutional Neural Networks in Emotional Speech. ENTROPY 2022; 24:e24030414. [PMID: 35327924 PMCID: PMC8947568 DOI: 10.3390/e24030414] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 02/22/2022] [Accepted: 02/28/2022] [Indexed: 02/01/2023]
Abstract
Speaker recognition is an important classification task, which can be solved using several approaches. Although building a speaker recognition model on a closed set of speakers under neutral speaking conditions is a well-researched task and there are solutions that provide excellent performance, the classification accuracy of developed models significantly decreases when applying them to emotional speech or in the presence of interference. Furthermore, deep models may require a large number of parameters, so constrained solutions are desirable in order to implement them on edge devices in the Internet of Things systems for real-time detection. The aim of this paper is to propose a simple and constrained convolutional neural network for speaker recognition tasks and to examine its robustness for recognition in emotional speech conditions. We examine three quantization methods for developing a constrained network: floating-point eight format, ternary scalar quantization, and binary scalar quantization. The results are demonstrated on the recently recorded SEAC dataset.
Collapse
|
23
|
Bekmanova G, Yergesh B, Sharipbay A, Mukanova A. Emotional Speech Recognition Method Based on Word Transcription. SENSORS 2022; 22:s22051937. [PMID: 35271083 PMCID: PMC8915129 DOI: 10.3390/s22051937] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 02/25/2022] [Accepted: 02/25/2022] [Indexed: 02/01/2023]
Abstract
The emotional speech recognition method presented in this article was applied to recognize the emotions of students during online exams in distance learning due to COVID-19. The purpose of this method is to recognize emotions in spoken speech through the knowledge base of emotionally charged words, which are stored as a code book. The method analyzes human speech for the presence of emotions. To assess the quality of the method, an experiment was conducted for 420 audio recordings. The accuracy of the proposed method is 79.7% for the Kazakh language. The method can be used for different languages and consists of the following tasks: capturing a signal, detecting speech in it, recognizing speech words in a simplified transcription, determining word boundaries, comparing a simplified transcription with a code book, and constructing a hypothesis about the degree of speech emotionality. In case of the presence of emotions, there occurs complete recognition of words and definitions of emotions in speech. The advantage of this method is the possibility of its widespread use since it is not demanding on computational resources. The described method can be applied when there is a need to recognize positive and negative emotions in a crowd, in public transport, schools, universities, etc. The experiment carried out has shown the effectiveness of this method. The results obtained will make it possible in the future to develop devices that begin to record and recognize a speech signal, for example, in the case of detecting negative emotions in sounding speech and, if necessary, transmitting a message about potential threats or riots.
Collapse
Affiliation(s)
- Gulmira Bekmanova
- Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Nur-Sultan 010008, Kazakhstan; (G.B.); (A.S.); (A.M.)
| | - Banu Yergesh
- Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Nur-Sultan 010008, Kazakhstan; (G.B.); (A.S.); (A.M.)
- Correspondence:
| | - Altynbek Sharipbay
- Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Nur-Sultan 010008, Kazakhstan; (G.B.); (A.S.); (A.M.)
| | - Assel Mukanova
- Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Nur-Sultan 010008, Kazakhstan; (G.B.); (A.S.); (A.M.)
- Higher School of Information Technology and Engineering, Astana International University, Nur-Sultan 010000, Kazakhstan
| |
Collapse
|
24
|
Sreevidya P, Veni S, Ramana Murthy OV. Elder emotion classification through multimodal fusion of intermediate layers and cross-modal transfer learning. SIGNAL, IMAGE AND VIDEO PROCESSING 2022; 16:1281-1288. [PMID: 35069919 PMCID: PMC8763433 DOI: 10.1007/s11760-021-02079-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Revised: 10/16/2021] [Accepted: 11/01/2021] [Indexed: 06/14/2023]
Abstract
The objective of the work is to develop an automated emotion recognition system specifically targeted to elderly people. A multi-modal system is developed which has integrated information from audio and video modalities. The database selected for experiments is ElderReact, which contains 1323 video clips of 3 to 8 s duration of people above the age of 50. Here, all the six available emotions Disgust, Anger, Fear, Happiness, Sadness and Surprise are considered. In order to develop an automated emotion recognition system for aged adults, we attempted different modeling techniques. Features are extracted, and neural network models with backpropagation are attempted for developing the models. Further, for the raw video model, transfer learning from pretrained networks is attempted. Convolutional neural network and long short-time memory-based models were taken by maintaining the continuity in time between the frames while capturing the emotions. For the audio model, cross-model transfer learning is applied. Both the models are combined by fusion of intermediate layers. The layers are selected through a grid-based search algorithm. The accuracy and F1-score show that the proposed approach is outperforming the state-of-the-art results. Classification of all the images shows a minimum relative improvement of 6.5% for happiness to a maximum of 46% increase for sadness over the baseline results.
Collapse
Affiliation(s)
- P. Sreevidya
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India
| | - S. Veni
- Department of Electronics and Communication Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India
| | - O. V. Ramana Murthy
- Department of Electrical and Electronics Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India
| |
Collapse
|
25
|
SUN J, HU Y, ZOU Y, GENG J, WU Y, FAN R, KANG Z. Identification of pesticide residues on black tea by fluorescence hyperspectral technology combined with machine learning. FOOD SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1590/fst.55822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
- Jie SUN
- Sichuan Agricultural University, China
| | - Yan HU
- Sichuan Agricultural University, China
| | - Yulin ZOU
- Sichuan Agricultural University, China
| | | | - Youli WU
- Sichuan Agricultural University, China
| | | | | |
Collapse
|
26
|
Liu G, Zhang Q, Cao Y, Tian G, Ji Z. Online human action recognition with spatial and temporal skeleton features using a distributed camera network. INT J INTELL SYST 2021. [DOI: 10.1002/int.22591] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Guoliang Liu
- School of Control Science and Engineering Shandong University Jinan China
| | - Qinghui Zhang
- School of Control Science and Engineering Shandong University Jinan China
| | - Yichao Cao
- School of Control Science and Engineering Shandong University Jinan China
| | - Guohui Tian
- School of Control Science and Engineering Shandong University Jinan China
| | - Ze Ji
- School of Engineering Cardiff University Cardiff UK
| |
Collapse
|
27
|
Cai YW, Dong FF, Shi YH, Lu LY, Chen C, Lin P, Xue YS, Chen JH, Chen SY, Luo XB. Deep learning driven colorectal lesion detection in gastrointestinal endoscopic and pathological imaging. World J Clin Cases 2021; 9:9376-9385. [PMID: 34877273 PMCID: PMC8610875 DOI: 10.12998/wjcc.v9.i31.9376] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 07/26/2021] [Accepted: 08/13/2021] [Indexed: 02/06/2023] Open
Abstract
Colorectal cancer has the second highest incidence of malignant tumors and is the fourth leading cause of cancer deaths in China. Early diagnosis and treatment of colorectal cancer will lead to an improvement in the 5-year survival rate, which will reduce medical costs. The current diagnostic methods for early colorectal cancer include excreta, blood, endoscopy, and computer-aided endoscopy. In this paper, research on image analysis and prediction of colorectal cancer lesions based on deep learning is reviewed with the goal of providing a reference for the early diagnosis of colorectal cancer lesions by combining computer technology, 3D modeling, 5G remote technology, endoscopic robot technology, and surgical navigation technology. The findings will supplement the research and provide insights to improve the cure rate and reduce the mortality of colorectal cancer.
Collapse
Affiliation(s)
- Yu-Wen Cai
- Department of Clinical Medicine, Fujian Medical University, Fuzhou 350004, Fujian Province, China
| | - Fang-Fen Dong
- Department of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350004, Fujian Province, China
| | - Yu-Heng Shi
- Computer Science and Engineering College, University of Alberta, Edmonton T6G 2R3, Canada
| | - Li-Yuan Lu
- Department of Clinical Medicine, Fujian Medical University, Fuzhou 350004, Fujian Province, China
| | - Chen Chen
- Department of Clinical Medicine, Fujian Medical University, Fuzhou 350004, Fujian Province, China
| | - Ping Lin
- Department of Clinical Medicine, Fujian Medical University, Fuzhou 350004, Fujian Province, China
| | - Yu-Shan Xue
- Department of Clinical Medicine, Fujian Medical University, Fuzhou 350004, Fujian Province, China
| | - Jian-Hua Chen
- Endoscopy Center, Fujian Cancer Hospital, Fujian Medical University Cancer Hospital, Fuzhou 350014, Fujian Province, China
| | - Su-Yu Chen
- Endoscopy Center, Fujian Cancer Hospital, Fujian Medical University Cancer Hospital, Fuzhou 350014, Fujian Province, China
| | - Xiong-Biao Luo
- Department of Computer Science, Xiamen University, Xiamen 361005, Fujian, China
| |
Collapse
|
28
|
Xu X, Zhang L, Trovati M, Palmieri F, Asimakopoulou E, Johnny O, Bessis N. PERMS: An efficient rescue route planning system in disasters. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107667] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
29
|
Fernandes B, Mannepalli K. Enhanced Deep Hierarchical Long Short-Term Memory and Bidirectional Long Short-Term Memory for Tamil Emotional Speech Recognition using Data Augmentation and Spatial Features. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 2021. [DOI: 10.47836/pjst.29.4.39] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Neural networks have become increasingly popular for language modelling and within these large and deep models, overfitting, and gradient remains an important problem that heavily influences the model performance. As long short-term memory (LSTM) and bidirectional long short-term memory (BILSTM) individually solve long-term dependencies in sequential data, the combination of both LSTM and BILSTM in hierarchical gives added reliability to minimise the gradient, overfitting, and long learning issues. Hence, this paper presents four different architectures such as the Enhanced Deep Hierarchal LSTM & BILSTM (EDHLB), EDHBL, EDHLL & EDHBB has been developed. The experimental evaluation of a deep hierarchical network with spatial and temporal features selects good results for four different models. The average accuracy of EDHLB is 92.12%, EDHBL is 93.13, EDHLL is 94.14% & EDHBB is 93.19% and the accuracy level obtained for the basic models such as the LSTM, which is 74% and BILSTM, which is 77%. By evaluating all the models, EDHBL performs better than other models, with an average efficiency of 94.14% and a good accuracy rate of 95.7%. Moreover, the accuracy for the collected Tamil emotional dataset, such as happiness, fear, anger, sadness, and neutral emotions indicates 100% accuracy in a cross-fold matrix. Emotions such as disgust show around 80% efficiency. Lastly, boredom shows 75% accuracy. Moreover, the training time and evaluation time utilised by EDHBL is less when compared with the other models. Therefore, the experimental analysis shows EDHBL as superior to the other models on the collected Tamil emotional dataset. When compared with the basic models, it has attained 20% more efficiency.
Collapse
|
30
|
Yao X, Sheng Z, Gu M, Wang H, Xu N, Liu X. Attention mechanism based LSTM in classification of stressed speech under workload. INTELL DATA ANAL 2021. [DOI: 10.3233/ida-205429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
In order to improve the robustness of speech recognition systems, this study attempts to classify stressed speech caused by the psychological stress under multitasking workloads. Due to the transient nature and ambiguity of stressed speech, the stress characteristics is not represented in all the segments in stressed speech as labeled. In this paper, we propose a multi-feature fusion model based on the attention mechanism to measure the importance of segments for stress classification. Through the attention mechanism, each speech frame is weighted to reflect the different correlations to the actual stressed state, and the multi-channel fusion of features characterizing the stressed speech to classify the speech under stress. The proposed model further adopts SpecAugment in view of the feature spectrum for data augment to resolve small sample sizes problem among stressed speech. During the experiment, we compared the proposed model with traditional methods on CASIA Chinese emotion corpus and Fujitsu stressed speech corpus, and results show that the proposed model has better performance in speaker-independent stress classification. Transfer learning is also performed for speaker-dependent classification for stressed speech, and the performance is improved. The attention mechanism shows the advantage for continuous speech under stress in authentic context comparing with traditional methods.
Collapse
Affiliation(s)
- Xiao Yao
- The College of IoT Engineering, Hohai University, Jiangsu, China
| | - Zhengyan Sheng
- The College of IoT Engineering, Hohai University, Jiangsu, China
| | - Min Gu
- Department of Stomatology, Affiliated Third Hospital of Soochow University, Suzhou, Jiangsu, China
- The First People’s Hospital of Changzhou, Changzhou, Jiangsu, China
| | - Haibin Wang
- The College of IoT Engineering, Hohai University, Jiangsu, China
| | - Ning Xu
- The College of IoT Engineering, Hohai University, Jiangsu, China
| | - Xiaofeng Liu
- The College of IoT Engineering, Hohai University, Jiangsu, China
| |
Collapse
|
31
|
Tursunov A, Mustaqeem, Choeh JY, Kwon S. Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms. SENSORS (BASEL, SWITZERLAND) 2021; 21:5892. [PMID: 34502785 PMCID: PMC8434188 DOI: 10.3390/s21175892] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Revised: 08/30/2021] [Accepted: 08/30/2021] [Indexed: 11/16/2022]
Abstract
Speech signals are being used as a primary input source in human-computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition (SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals.
Collapse
Affiliation(s)
- Anvarjon Tursunov
- Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea; (A.T.); (M.)
| | - Mustaqeem
- Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea; (A.T.); (M.)
| | - Joon Yeon Choeh
- Intelligent Contents Laboratory, Department of Software, Sejong University, Seoul 05006, Korea;
| | - Soonil Kwon
- Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea; (A.T.); (M.)
| |
Collapse
|
32
|
Alves AAC, Andrietta LT, Lopes RZ, Bussiman FO, Silva FFE, Carvalheiro R, Brito LF, Balieiro JCDC, Albuquerque LG, Ventura RV. Integrating Audio Signal Processing and Deep Learning Algorithms for Gait Pattern Classification in Brazilian Gaited Horses. FRONTIERS IN ANIMAL SCIENCE 2021. [DOI: 10.3389/fanim.2021.681557] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This study focused on assessing the usefulness of using audio signal processing in the gaited horse industry. A total of 196 short-time audio files (4 s) were collected from video recordings of Brazilian gaited horses. These files were converted into waveform signals (196 samples by 80,000 columns) and divided into training (N = 164) and validation (N = 32) datasets. Twelve single-valued audio features were initially extracted to summarize the training data according to the gait patterns (Marcha Batida—MB and Marcha Picada—MP). After preliminary analyses, high-dimensional arrays of the Mel Frequency Cepstral Coefficients (MFCC), Onset Strength (OS), and Tempogram (TEMP) were extracted and used as input information in the classification algorithms. A principal component analysis (PCA) was performed using the 12 single-valued features set and each audio-feature dataset—AFD (MFCC, OS, and TEMP) for prior data visualization. Machine learning (random forest, RF; support vector machine, SVM) and deep learning (multilayer perceptron neural networks, MLP; convolution neural networks, CNN) algorithms were used to classify the gait types. A five-fold cross-validation scheme with 10 repetitions was employed for assessing the models' predictive performance. The classification performance across models and AFD was also validated with independent observations. The models and AFD were compared based on the classification accuracy (ACC), specificity (SPEC), sensitivity (SEN), and area under the curve (AUC). In the logistic regression analysis, five out of the 12 audio features extracted were significant (p < 0.05) between the gait types. ACC averages ranged from 0.806 to 0.932 for MFCC, from 0.758 to 0.948 for OS and, from 0.936 to 0.968 for TEMP. Overall, the TEMP dataset provided the best classification accuracies for all models. The most suitable method for audio-based horse gait pattern classification was CNN. Both cross and independent validation schemes confirmed that high values of ACC, SPEC, SEN, and AUC are expected for yet-to-be-observed labels, except for MFCC-based models, in which clear overfitting was observed. Using audio-generated data for describing gait phenotypes in Brazilian horses is a promising approach, as the two gait patterns were correctly distinguished. The highest classification performance was achieved by combining CNN and the rhythmic-descriptive AFD.
Collapse
|
33
|
Multi-Modal Residual Perceptron Network for Audio-Video Emotion Recognition. SENSORS 2021; 21:s21165452. [PMID: 34450894 PMCID: PMC8399720 DOI: 10.3390/s21165452] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 07/30/2021] [Accepted: 08/10/2021] [Indexed: 11/17/2022]
Abstract
Emotion recognition is an important research field for human-computer interaction. Audio-video emotion recognition is now attacked with deep neural network modeling tools. In published papers, as a rule, the authors show only cases of the superiority in multi-modality over audio-only or video-only modality. However, there are cases of superiority in uni-modality that can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the within-modal and inter-modal noisy information represented indirectly in the parameters of the modeling neural network impedes better performance in the existing late fusion and end-to-end multi-modal network training strategies. To take advantage of and overcome the deficiencies in both solutions, we define a multi-modal residual perceptron network which performs end-to-end learning from multi-modal network branches, generalizing better multi-modal feature representation. For the proposed multi-modal residual perceptron network and the novel time augmentation for streaming digital movies, the state-of-the-art average recognition rate was improved to 91.4% for the Ryerson Audio-Visual Database of Emotional Speech and Song dataset and to 83.15% for the Crowd-Sourced Emotional Multi Modal Actors dataset. Moreover, the multi-modal residual perceptron network concept shows its potential for multi-modal applications dealing with signal sources not only of optical and acoustical types.
Collapse
|
34
|
Fernandes B, Mannepalli K. Speech Emotion Recognition Using Deep Learning LSTM for Tamil Language. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 2021. [DOI: 10.47836/pjst.29.3.33] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Deep Neural Networks (DNN) are more than just neural networks with several hidden units that gives better results with classification algorithm in automated voice recognition activities. Then spatial correlation was considered in traditional feedforward neural networks and which do not manage speech signal properly to it extend, so recurrent neural networks (RNNs) were implemented. Long Short-Term Memory (LSTM) systems is a unique case of RNNs for speech processing, thus considering long-term dependencies Deep Hierarchical LSTM and BiLSTM is designed with dropout layers to reduce the gradient and long-term learning error in emotional speech analysis. Thus, four different combinations of deep hierarchical learning architecture Deep Hierarchical LSTM and LSTM (DHLL), Deep Hierarchical LSTM and BiLSTM (DHLB), Deep Hierarchical BiLSTM and LSTM (DHBL) and Deep Hierarchical dual BiLSTM (DHBB) is designed with dropout layers to improve the networks. The performance test of all four model were compared in this paper and better efficiency of classification is attained with minimal dataset of Tamil Language. The experimental results show that DHLB reaches the best precision of about 84% in recognition of emotions for Tamil database, however, the DHBL gives 83% of efficiency. Other design layers also show equal performance but less than the above models DHLL & DHBB shows 81% of efficiency for lesser dataset and minimal execution and training time.
Collapse
|
35
|
Abstract
The field of mechanical fault diagnosis has entered the era of “big data”. However, existing diagnostic algorithms, relying on artificial feature extraction and expert knowledge are of poor extraction ability and lack self-adaptability in the mass data. In the fault diagnosis of rotating machinery, due to the accidental occurrence of equipment faults, the proportion of fault samples is small, the samples are imbalanced, and available data are scarce, which leads to the low accuracy rate of the intelligent diagnosis model trained to identify the equipment state. To solve the above problems, an end-to-end diagnosis model is first proposed, which is an intelligent fault diagnosis method based on one-dimensional convolutional neural network (1D-CNN). That is to say, the original vibration signal is directly input into the model for identification. After that, through combining the convolutional neural network with the generative adversarial networks, a data expansion method based on the one-dimensional deep convolutional generative adversarial networks (1D-DCGAN) is constructed to generate small sample size fault samples and construct the balanced data set. Meanwhile, in order to solve the problem that the network is difficult to optimize, gradient penalty and Wasserstein distance are introduced. Through the test of bearing database and hydraulic pump, it shows that the one-dimensional convolution operation has strong feature extraction ability for vibration signals. The proposed method is very accurate for fault diagnosis of the two kinds of equipment, and high-quality expansion of the original data can be achieved.
Collapse
|
36
|
Abstract
Emotions are an integral part of human interactions and are significant factors in determining user satisfaction or customer opinion. speech emotion recognition (SER) modules also play an important role in the development of human–computer interaction (HCI) applications. A tremendous number of SER systems have been developed over the last decades. Attention-based deep neural networks (DNNs) have been shown as suitable tools for mining information that is unevenly time distributed in multimedia content. The attention mechanism has been recently incorporated in DNN architectures to emphasise also emotional salient information. This paper provides a review of the recent development in SER and also examines the impact of various attention mechanisms on SER performance. Overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database.
Collapse
|
37
|
Sea Ice Classification of SAR Imagery Based on Convolution Neural Networks. REMOTE SENSING 2021. [DOI: 10.3390/rs13091734] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
We explore new and existing convolutional neural network (CNN) architectures for sea ice classification using Sentinel-1 (S1) synthetic aperture radar (SAR) data by investigating two key challenges: binary sea ice versus open-water classification, and a multi-class sea ice type classification. The analysis of sea ice in SAR images is challenging because of the thermal noise effects and ambiguities in the radar backscatter for certain conditions that include the reflection of complex information from sea ice surfaces. We use manually annotated SAR images containing various sea ice types to construct a dataset for our Deep Learning (DL) analysis. To avoid contamination between classes we use a combination of near-simultaneous SAR images from S1 and fine resolution cloud-free optical data from Sentinel-2 (S2). For the classification, we use data augmentation to adjust for the imbalance of sea ice type classes in the training data. The SAR images are divided into small patches which are processed one at a time. We demonstrate that the combination of data augmentation and training of a proposed modified Visual Geometric Group 16-layer (VGG-16) network, trained from scratch, significantly improves the classification performance, compared to the original VGG-16 model and an ad hoc CNN model. The experimental results show both qualitatively and quantitatively that our models produce accurate classification results.
Collapse
|
38
|
Ullah W, Ullah A, Hussain T, Khan ZA, Baik SW. An Efficient Anomaly Recognition Framework Using an Attention Residual LSTM in Surveillance Videos. SENSORS 2021; 21:s21082811. [PMID: 33923712 PMCID: PMC8072779 DOI: 10.3390/s21082811] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 04/08/2021] [Accepted: 04/12/2021] [Indexed: 11/16/2022]
Abstract
Video anomaly recognition in smart cities is an important computer vision task that plays a vital role in smart surveillance and public safety but is challenging due to its diverse, complex, and infrequent occurrence in real-time surveillance environments. Various deep learning models use significant amounts of training data without generalization abilities and with huge time complexity. To overcome these problems, in the current work, we present an efficient light-weight convolutional neural network (CNN)-based anomaly recognition framework that is functional in a surveillance environment with reduced time complexity. We extract spatial CNN features from a series of video frames and feed them to the proposed residual attention-based long short-term memory (LSTM) network, which can precisely recognize anomalous activity in surveillance videos. The representative CNN features with the residual blocks concept in LSTM for sequence learning prove to be effective for anomaly detection and recognition, validating our model’s effective usage in smart cities video surveillance. Extensive experiments on the real-world benchmark UCF-Crime dataset validate the effectiveness of the proposed model within complex surveillance environments and demonstrate that our proposed model outperforms state-of-the-art models with a 1.77%, 0.76%, and 8.62% increase in accuracy on the UCF-Crime, UMN and Avenue datasets, respectively.
Collapse
|
39
|
A Machine Learning Method for the Fine-Grained Classification of Green Tea with Geographical Indication Using a MOS-Based Electronic Nose. Foods 2021; 10:foods10040795. [PMID: 33917735 PMCID: PMC8068162 DOI: 10.3390/foods10040795] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2021] [Revised: 03/20/2021] [Accepted: 03/30/2021] [Indexed: 11/16/2022] Open
Abstract
Chinese green tea is known for its health-functional properties. There are many green tea categories, which have sub-categories with geographical indications (GTSGI). Several high-quality GTSGI planted in specific areas are labeled as famous GTSGI (FGTSGI) and are expensive. However, the subtle differences between the categories complicate the fine-grained classification of the GTSGI. This study proposes a novel framework consisting of a convolutional neural network backbone (CNN backbone) and a support vector machine classifier (SVM classifier), namely, CNN-SVM for the classification of Maofeng green tea categories (six sub-categories) and Maojian green tea categories (six sub-categories) using electronic nose data. A multi-channel input matrix was constructed for the CNN backbone to extract deep features from different sensor signals. An SVM classifier was employed to improve the classification performance due to its high discrimination ability for small sample sizes. The effectiveness of this framework was verified by comparing it with four other machine learning models (SVM, CNN-Shi, CNN-SVM-Shi, and CNN). The proposed framework had the best performance for classifying the GTSGI and identifying the FGTSGI. The high accuracy and strong robustness of the CNN-SVM show its potential for the fine-grained classification of multiple highly similar teas.
Collapse
|
40
|
Investigating the Effects of Training Set Synthesis for Audio Segmentation of Radio Broadcast. ELECTRONICS 2021. [DOI: 10.3390/electronics10070827] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Music and speech detection provides us valuable information regarding the nature of content in broadcast audio. It helps detect acoustic regions that contain speech, voice over music, only music, or silence. In recent years, there have been developments in machine learning algorithms to accomplish this task. However, broadcast audio is generally well-mixed and copyrighted, which makes it challenging to share across research groups. In this study, we address the challenges encountered in automatically synthesising data that resembles a radio broadcast. Firstly, we compare state-of-the-art neural network architectures such as CNN, GRU, LSTM, TCN, and CRNN. Later, we investigate how audio ducking of background music impacts the precision and recall of the machine learning algorithm. Thirdly, we examine how the quantity of synthetic training data impacts the results. Finally, we evaluate the effectiveness of synthesised, real-world, and combined approaches for training models, to understand if the synthetic data presents any additional value. Amongst the network architectures, CRNN was the best performing network. Results also show that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners. After testing our model on in-house and public datasets, we observe that our proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative.
Collapse
|
41
|
Patel N, Patel S, Mankad SH. Impact of autoencoder based compact representation on emotion detection from audio. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING 2021; 13:867-885. [PMID: 33686349 PMCID: PMC7927770 DOI: 10.1007/s12652-021-02979-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Accepted: 02/15/2021] [Indexed: 06/12/2023]
Abstract
Emotion recognition from speech has its fair share of applications and consequently extensive research has been done over the past few years in this interesting field. However, many of the existing solutions aren't yet ready for real time applications. In this work, we propose a compact representation of audio using conventional autoencoders for dimensionality reduction, and test the approach on two benchmark publicly available datasets. Such compact and simple classification systems where the computing cost is low and memory is managed efficiently may be more useful for real time application. System is evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and the Toronto Emotional Speech Set (TESS). Three classifiers, namely, support vector machines (SVM), decision tree classifier, and convolutional neural networks (CNN) have been implemented to judge the impact of the approach. The results obtained by attempting classification with Alexnet and Resnet50 are also reported. Observations proved that this introduction of autoencoders indeed can improve the classification accuracy of the emotion in the input audio files. It can be concluded that in emotion recognition from speech, the choice and application of dimensionality reduction of audio features impacts the results that are achieved and therefore, by working on this aspect of the general speech emotion recognition model, it may be possible to make great improvements in the future.
Collapse
Affiliation(s)
- Nivedita Patel
- CSE Department, Institute of Technology, Nirma University, Ahmedabad, India
| | - Shireen Patel
- CSE Department, Institute of Technology, Nirma University, Ahmedabad, India
| | - Sapan H. Mankad
- CSE Department, Institute of Technology, Nirma University, Ahmedabad, India
| |
Collapse
|
42
|
Al-Saegh A, Dawwd SA, Abdul-Jabbar JM. Deep learning for motor imagery EEG-based classification: A review. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2020.102172] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
43
|
Seo M, Kim M. Fusing Visual Attention CNN and Bag of Visual Words for Cross-Corpus Speech Emotion Recognition. SENSORS (BASEL, SWITZERLAND) 2020; 20:E5559. [PMID: 32998382 PMCID: PMC7583996 DOI: 10.3390/s20195559] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 09/25/2020] [Accepted: 09/26/2020] [Indexed: 11/16/2022]
Abstract
Speech emotion recognition (SER) classifies emotions using low-level features or a spectrogram of an utterance. When SER methods are trained and tested using different datasets, they have shown performance reduction. Cross-corpus SER research identifies speech emotion using different corpora and languages. Recent cross-corpus SER research has been conducted to improve generalization. To improve the cross-corpus SER performance, we pretrained the log-mel spectrograms of the source dataset using our designed visual attention convolutional neural network (VACNN), which has a 2D CNN base model with channel- and spatial-wise visual attention modules. To train the target dataset, we extracted the feature vector using a bag of visual words (BOVW) to assist the fine-tuned model. Because visual words represent local features in the image, the BOVW helps VACNN to learn global and local features in the log-mel spectrogram by constructing a frequency histogram of visual words. The proposed method shows an overall accuracy of 83.33%, 86.92%, and 75.00% in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Berlin Database of Emotional Speech (EmoDB), and Surrey Audio-Visual Expressed Emotion (SAVEE), respectively. Experimental results on RAVDESS, EmoDB, SAVEE demonstrate improvements of 7.73%, 15.12%, and 2.34% compared to existing state-of-the-art cross-corpus SER approaches.
Collapse
Affiliation(s)
| | - Myungho Kim
- Department of Software Convergence, Soongsil University, 369, Sangdo-ro, Dongjak-gu, Seoul 06978, Korea;
| |
Collapse
|
44
|
Anvarjon T, Mustaqeem, Kwon S. Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features. SENSORS (BASEL, SWITZERLAND) 2020; 20:E5212. [PMID: 32932723 PMCID: PMC7570673 DOI: 10.3390/s20185212] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 09/09/2020] [Accepted: 09/10/2020] [Indexed: 01/09/2023]
Abstract
Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.
Collapse
Affiliation(s)
- Tursunov Anvarjon
- Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea
| | - Mustaqeem
- Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea
| | - Soonil Kwon
- Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea
| |
Collapse
|
45
|
Abstract
Applications with large-scale data are processed on a distributed system, such as Spark, as they are data- and computation-intensive. Predicting the performance of such applications is difficult, because they are influenced by various aspects of configurations from the distributed framework level to the application level. In this paper, we propose a completion time prediction model based on machine learning for the representative deep learning model convolutional neural network (CNN) by analyzing the effects of data, task, and resource characteristics on performance when executing the model in Spark cluster. To reduce the time utilized in collecting the data for training the model, we consider the causal relationship between the model features and the completion time based on Spark CNN’s distributed data-parallel model. The model features include the configurations of the Data Center OS Mesos environment, configurations of Apache Spark, and configurations of the CNN model. By applying the proposed model to famous CNN implementations, we achieved 99.98% prediction accuracy about estimating the job completion time. In addition to the downscale search area for the model features, we leverage extrapolation, which significantly reduces the model build time at most to 89% with even better prediction accuracy in comparison to the actual work.
Collapse
|
46
|
Raheel A, Majid M, Alnowami M, Anwar SM. Physiological Sensors Based Emotion Recognition While Experiencing Tactile Enhanced Multimedia. SENSORS 2020; 20:s20144037. [PMID: 32708056 PMCID: PMC7411620 DOI: 10.3390/s20144037] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 05/12/2020] [Accepted: 05/14/2020] [Indexed: 12/18/2022]
Abstract
Emotion recognition has increased the potential of affective computing by getting an instant feedback from users and thereby, have a better understanding of their behavior. Physiological sensors have been used to recognize human emotions in response to audio and video content that engages single (auditory) and multiple (two: auditory and vision) human senses, respectively. In this study, human emotions were recognized using physiological signals observed in response to tactile enhanced multimedia content that engages three (tactile, vision, and auditory) human senses. The aim was to give users an enhanced real-world sensation while engaging with multimedia content. To this end, four videos were selected and synchronized with an electric fan and a heater, based on timestamps within the scenes, to generate tactile enhanced content with cold and hot air effect respectively. Physiological signals, i.e., electroencephalography (EEG), photoplethysmography (PPG), and galvanic skin response (GSR) were recorded using commercially available sensors, while experiencing these tactile enhanced videos. The precision of the acquired physiological signals (including EEG, PPG, and GSR) is enhanced using pre-processing with a Savitzky-Golay smoothing filter. Frequency domain features (rational asymmetry, differential asymmetry, and correlation) from EEG, time domain features (variance, entropy, kurtosis, and skewness) from GSR, heart rate and heart rate variability from PPG data are extracted. The K nearest neighbor classifier is applied to the extracted features to classify four (happy, relaxed, angry, and sad) emotions. Our experimental results show that among individual modalities, PPG-based features gives the highest accuracy of 78.57% as compared to EEG- and GSR-based features. The fusion of EEG, GSR, and PPG features further improved the classification accuracy to 79.76% (for four emotions) when interacting with tactile enhanced multimedia.
Collapse
Affiliation(s)
- Aasim Raheel
- Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan;
| | - Muhammad Majid
- Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan;
- Correspondence:
| | - Majdi Alnowami
- Department of Nuclear Engineering, King Abdulaziz University, Jeddah 21589, Saudi Arabia;
| | - Syed Muhammad Anwar
- Department of Software Engineering, University of Engineering and Technology, Taxila 47050, Pakistan;
| |
Collapse
|
47
|
A Multiscale Spatio-Temporal Convolutional Deep Belief Network for Sensor Fault Detection of Wind Turbine. SENSORS 2020; 20:s20123580. [PMID: 32599907 PMCID: PMC7349861 DOI: 10.3390/s20123580] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 06/15/2020] [Accepted: 06/22/2020] [Indexed: 12/26/2022]
Abstract
Sensor fault detection of wind turbines plays an important role in improving the reliability and stable operation of turbines. The supervisory control and data acquisition (SCADA) system of a wind turbine provides promising insights into sensor fault detection due to the accessibility of the data and the abundance of sensor information. However, SCADA data are essentially multivariate time series with inherent spatio-temporal correlation characteristics, which has not been well considered in the existing wind turbine fault detection research. This paper proposes a novel classification-based fault detection method for wind turbine sensors. To better capture the spatio-temporal characteristics hidden in SCADA data, a multiscale spatio-temporal convolutional deep belief network (MSTCDBN) was developed to perform feature learning and classification to fulfill the sensor fault detection. A major superiority of the proposed method is that it can not only learn the spatial correlation information between several different variables but also capture the temporal characteristics of each variable. Furthermore, this method with multiscale learning capability can excavate interactive characteristics between variables at different scales of filters. A generic wind turbine benchmark model was used to evaluate the proposed approach. The comparative results demonstrate that the proposed method can significantly enhance the fault detection performance.
Collapse
|
48
|
Ullah H, Zia O, Kim JH, Han K, Lee JW. Automatic 360° Mono-Stereo Panorama Generation Using a Cost-Effective Multi-Camera System. SENSORS 2020; 20:s20113097. [PMID: 32486231 PMCID: PMC7309002 DOI: 10.3390/s20113097] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Revised: 05/11/2020] [Accepted: 05/27/2020] [Indexed: 11/16/2022]
Abstract
In recent years, 360° videos have gained the attention of researchers due to their versatility and applications in real-world problems. Also, easy access to different visual sensor kits and easily deployable image acquisition devices have played a vital role in the growth of interest in this area by the research community. Recently, several 360° panorama generation systems have demonstrated reasonable quality generated panoramas. However, these systems are equipped with expensive image sensor networks where multiple cameras are mounted in a circular rig with specific overlapping gaps. In this paper, we propose an economical 360° panorama generation system that generates both mono and stereo panoramas. For mono panorama generation, we present a drone-mounted image acquisition sensor kit that consists of six cameras placed in a circular fashion with optimal overlapping gap. The hardware of our proposed image acquisition system is configured in such way that no user input is required to stitch multiple images. For stereo panorama generation, we propose a lightweight, cost-effective visual sensor kit that uses only three cameras to cover 360° of the surroundings. We also developed stitching software that generates both mono and stereo panoramas using a single image stitching pipeline where the panorama generated by our proposed system is automatically straightened without visible seams. Furthermore, we compared our proposed system with existing mono and stereo contents generation systems in both qualitative and quantitative perspectives, and the comparative measurements obtained verified the effectiveness of our system compared to existing mono and stereo generation systems.
Collapse
Affiliation(s)
- Hayat Ullah
- Mixed Reality and Interaction Lab, Department of Software, Sejong University, Seoul 143-747, Korea; (H.U.); (O.Z.); (K.H.)
| | - Osama Zia
- Mixed Reality and Interaction Lab, Department of Software, Sejong University, Seoul 143-747, Korea; (H.U.); (O.Z.); (K.H.)
| | - Jun Ho Kim
- Department of Electrical Information Control, Dong Seoul University, Seongnam 461-140, Korea;
| | - Kyungjin Han
- Mixed Reality and Interaction Lab, Department of Software, Sejong University, Seoul 143-747, Korea; (H.U.); (O.Z.); (K.H.)
| | - Jong Weon Lee
- Mixed Reality and Interaction Lab, Department of Software, Sejong University, Seoul 143-747, Korea; (H.U.); (O.Z.); (K.H.)
- Correspondence:
| |
Collapse
|
49
|
Khan ZA, Hussain T, Ullah A, Rho S, Lee M, Baik SW. Towards Efficient Electricity Forecasting in Residential and Commercial Buildings: A Novel Hybrid CNN with a LSTM-AE based Framework. SENSORS 2020; 20:s20051399. [PMID: 32143371 PMCID: PMC7085604 DOI: 10.3390/s20051399] [Citation(s) in RCA: 67] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 02/28/2020] [Accepted: 02/28/2020] [Indexed: 12/19/2022]
Abstract
Due to industrialization and the rising demand for energy, global energy consumption has been rapidly increasing. Recent studies show that the biggest portion of energy is consumed in residential buildings, i.e., in European Union countries up to 40% of the total energy is consumed by households. Most residential buildings and industrial zones are equipped with smart sensors such as metering electric sensors, that are inadequately utilized for better energy management. In this paper, we develop a hybrid convolutional neural network (CNN) with an long short-term memory autoencoder (LSTM-AE) model for future energy prediction in residential and commercial buildings. The central focus of this research work is to utilize the smart meters’ data for energy forecasting in order to enable appropriate energy management in buildings. We performed extensive research using several deep learning-based forecasting models and proposed an optimal hybrid CNN with the LSTM-AE model. To the best of our knowledge, we are the first to incorporate the aforementioned models under the umbrella of a unified framework with some utility preprocessing. Initially, the CNN model extracts features from the input data, which are then fed to the LSTM-encoder to generate encoded sequences. The encoded sequences are decoded by another following LSTM-decoder to advance it to the final dense layer for energy prediction. The experimental results using different evaluation metrics show that the proposed hybrid model works well. Also, it records the smallest value for mean square error (MSE), mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) when compared to other state-of-the-art forecasting methods over the UCI residential building dataset. Furthermore, we conducted experiments on Korean commercial building data and the results indicate that our proposed hybrid model is a worthy contribution to energy forecasting.
Collapse
|