1
|
Oguike O, Primus M. A dataset for multimodal music information retrieval of Sotho-Tswana musical videos. Data Brief 2024; 55:110672. [PMID: 39071970 PMCID: PMC11282976 DOI: 10.1016/j.dib.2024.110672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 06/17/2024] [Accepted: 06/18/2024] [Indexed: 07/30/2024] Open
Abstract
The existence of diverse traditional machine learning and deep learning models designed for various multimodal music information retrieval (MIR) applications, such as multimodal music sentiment analysis, genre classification, recommender systems, and emotion recognition, renders the machine learning and deep learning models indispensable for the MIR tasks. However, solving these tasks in a data-driven manner depends on the availability of high-quality benchmark datasets. Hence, the necessity for datasets tailored for multimodal music information retrieval applications is paramount. While a handful of multimodal datasets exist for distinct music information retrieval applications, they are not available in low-resourced languages, like Sotho-Tswana languages. In response to this gap, we introduce a novel multimodal music information retrieval dataset for various music information retrieval applications. This dataset centres on Sotho-Tswana musical videos, encompassing both textual, visual, and audio modalities specific to Sotho-Tswana musical content. The musical videos were downloaded from YouTube, but Python programs were written to process the musical videos and extract relevant spectral-based acoustic features, using different Python libraries. Annotation of the dataset was done manually by native speakers of Sotho-Tswana languages, who understand the culture and traditions of the Sotho-Tswana people. It is distinctive as, to our knowledge, no such dataset has been established until now.
Collapse
Affiliation(s)
- Osondu Oguike
- Institute for Intelligent Systems, University of Johannesburg, JBS Park, 69 Kingsway Avenue, Auckland Park, Johannesburg, South Africa
| | - Mpho Primus
- Institute for Intelligent Systems, University of Johannesburg, JBS Park, 69 Kingsway Avenue, Auckland Park, Johannesburg, South Africa
| |
Collapse
|
2
|
Wang L. Multimodal robotic music performance art based on GRU-GoogLeNet model fusing audiovisual perception. Front Neurorobot 2024; 17:1324831. [PMID: 38351965 PMCID: PMC10861776 DOI: 10.3389/fnbot.2023.1324831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 12/06/2023] [Indexed: 02/16/2024] Open
Abstract
The field of multimodal robotic musical performing arts has garnered significant interest due to its innovative potential. Conventional robots face limitations in understanding emotions and artistic expression in musical performances. Therefore, this paper explores the application of multimodal robots that integrate visual and auditory perception to enhance the quality and artistic expression in music performance. Our approach involves integrating GRU (Gated Recurrent Unit) and GoogLeNet models for sentiment analysis. The GRU model processes audio data and captures the temporal dynamics of musical elements, including long-term dependencies, to extract emotional information. The GoogLeNet model excels in image processing, extracting complex visual details and aesthetic features. This synergy deepens the understanding of musical and visual elements, aiming to produce more emotionally resonant and interactive robot performances. Experimental results demonstrate the effectiveness of our approach, showing significant improvements in music performance by multimodal robots. These robots, equipped with our method, deliver high-quality, artistic performances that effectively evoke emotional engagement from the audience. Multimodal robots that merge audio-visual perception in music performance enrich the art form and offer diverse human-machine interactions. This research demonstrates the potential of multimodal robots in music performance, promoting the integration of technology and art. It opens new realms in performing arts and human-robot interactions, offering a unique and innovative experience. Our findings provide valuable insights for the development of multimodal robots in the performing arts sector.
Collapse
Affiliation(s)
- Lu Wang
- School of Preschool and Art Education, Xinyang Vocational and Technical College, Xinyang, China
| |
Collapse
|
3
|
Shen Q. The influence of music teaching appreciation on the mental health of college students based on multimedia data analysis. PeerJ Comput Sci 2023; 9:e1589. [PMID: 37810333 PMCID: PMC10557508 DOI: 10.7717/peerj-cs.1589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 08/23/2023] [Indexed: 10/10/2023]
Abstract
The mental health problem of college students has gradually become the focus of people's attention. The music appreciation course in university is a very effective approach of psychological counseling, and it is urgent to explore the role of music appreciation in psychological adjustment. Therefore, we propose an emotion classification model based on particle swarm optimization (PSO) to study the effect of inter active music appreciation teaching on the mental health of college students. We first extract musical features as input. Then, the extracted music appreciation features generate subtitles of music information. Finally, we weight the above features, input them into the network, modify the network through particle swarm optimization, and output the emotional class of music. The experimental results show that the music emotion classification model has a high classification accuracy of 82.6%, and can obtain the emotional categories included in interactive music appreciation, which is helpful to guide the mental health of college students in music appreciation teaching.
Collapse
Affiliation(s)
- Qiangwei Shen
- School of Foreign Languages, Xinyang University, Xinyang, Henan, China
| |
Collapse
|
4
|
Koh EY, Cheuk KW, Heung KY, Agres KR, Herremans D. MERP: A Music Dataset with Emotion Ratings and Raters' Profile Information. SENSORS (BASEL, SWITZERLAND) 2022; 23:382. [PMID: 36616980 PMCID: PMC9824842 DOI: 10.3390/s23010382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 12/12/2022] [Accepted: 12/27/2022] [Indexed: 06/17/2023]
Abstract
Music is capable of conveying many emotions. The level and type of emotion of the music perceived by a listener, however, is highly subjective. In this study, we present the Music Emotion Recognition with Profile information dataset (MERP). This database was collected through Amazon Mechanical Turk (MTurk) and features dynamical valence and arousal ratings of 54 selected full-length songs. The dataset contains music features, as well as user profile information of the annotators. The songs were selected from the Free Music Archive using an innovative method (a Triple Neural Network with the OpenSmile toolkit) to identify 50 songs with the most distinctive emotions. Specifically, the songs were chosen to fully cover the four quadrants of the valence-arousal space. Four additional songs were selected from the DEAM dataset to act as a benchmark in this study and filter out low quality ratings. A total of 452 participants participated in annotating the dataset, with 277 participants remaining after thoroughly cleaning the dataset. Their demographic information, listening preferences, and musical background were recorded. We offer an extensive analysis of the resulting dataset, together with a baseline emotion prediction model based on a fully connected model and an LSTM model, for our newly proposed MERP dataset.
Collapse
Affiliation(s)
- En Yan Koh
- Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore 487372, Singapore
| | - Kin Wai Cheuk
- Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore 487372, Singapore
| | - Kwan Yee Heung
- Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore 487372, Singapore
| | - Kat R. Agres
- Yong Siew Toh Conservatory of Music, National University Singapore, Singapore 117376, Singapore
- Centre for Music and Health, National University Singapore, Singapore 117376, Singapore
| | - Dorien Herremans
- Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Singapore 487372, Singapore
| |
Collapse
|
5
|
Emotion Classification from Speech and Text in Videos Using a Multimodal Approach. MULTIMODAL TECHNOLOGIES AND INTERACTION 2022. [DOI: 10.3390/mti6040028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Emotion classification is a research area in which there has been very intensive literature production concerning natural language processing, multimedia data, semantic knowledge discovery, social network mining, and text and multimedia data mining. This paper addresses the issue of emotion classification and proposes a method for classifying the emotions expressed in multimodal data extracted from videos. The proposed method models multimodal data as a sequence of features extracted from facial expressions, speech, gestures, and text, using a linguistic approach. Each sequence of multimodal data is correctly associated with the emotion by a method that models each emotion using a hidden Markov model. The trained model is evaluated on samples of multimodal sentences associated with seven basic emotions. The experimental results demonstrate a good classification rate for emotions.
Collapse
|
6
|
Zeng X, Zhong Z. Multimodal Sentiment Analysis of Online Product Information Based on Text Mining Under the Influence of Social Media. J ORGAN END USER COM 2022. [DOI: 10.4018/joeuc.314786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Currently, with the dramatic increase in social media users and the greater variety of online product information, manual processing of this information is time-consuming and labour-intensive. Therefore, based on the text mining of online information, this paper analyzes the text representation method of online information, discusses the long short-term memory network, and constructs an interactive attention graph convolutional network (IAGCN) model based on graph convolutional neural network (GCNN) and attention mechanism to study the multimodal sentiment analysis (MSA) of online product information. The results show that the IAGCN model improves the accuracy by 4.78% and the F1 value by 29.25% compared with the pure interactive attention network. Meanwhile, it is found that the performance of the model is optimal when the GCNN is two layers and uses syntactic position attention. This research has important practical significance for MSA of online product information in social media.
Collapse
Affiliation(s)
- Xiao Zeng
- Huazhong University of Science and Technology, China
| | - Ziqi Zhong
- The London School of Economics and Political Science, UK
| |
Collapse
|
7
|
Research on Music Style Classification Based on Deep Learning. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:3699885. [PMID: 35087600 PMCID: PMC8789415 DOI: 10.1155/2022/3699885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 12/05/2021] [Accepted: 01/04/2022] [Indexed: 11/29/2022]
Abstract
Music style is one of the important labels for music classification, and the current music style classification methods extract features such as rhythm and timbre of music and use classifiers to achieve classification. The classification accuracy is not only affected by the classifier but also limited by the effect of music feature extraction, which leads to poor classification accuracy and stability. In response to the abovementioned defects, a deep-learning-based music style classification method will be studied. The music signal is framed using filters and Hamming windows, and the MFCC coefficient features of music are extracted by discrete Fourier transform. A convolutional recurrent neural network structure combining CNN and RNN is designed and trained to determine the parameters to achieve music style classification. Analysis of the simulation experimental data shows that the classification accuracy of the studied classification method is at least 93.3%, and the classification time overhead is significantly reduced, the classification results are stable, and the results are reliable.
Collapse
|
8
|
Fang Z, Qian Y, Su C, Miao Y, Li Y. The Multimodal Sentiment Analysis of Online Product Marketing Information Using Text Mining and Big Data. J ORGAN END USER COM 2022. [DOI: 10.4018/joeuc.316124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Currently, the internet is increasingly popular. More people are used to sharing their feelings about various things on the internet. Online product marketing information is also growing. How to mine the required information from the massive information with the support of big data technology has become a big problem. Thereby, based on the text mining of online product marketing information, this work discusses the text preprocessing methods and the temporal convolution network (TCN) based on a convolutional neural network (CNN). Moreover, on this basis, multimodal attention mechanism (AM) and cross-modal transformer structure are added to build a TCN based on AM (AM-TCN) model to analyze the multimodal emotion of online product marketing information. The results show that the accuracy of the AM-TCN model is 2.88% higher than that of the TCN model alone, and F1 is 3.47% higher. Moreover, the accuracy rate of the AM-TCN is 1.22% higher than that of the next highest recurrent multistage fusion network, and the F1 value is 0.95% higher.
Collapse
Affiliation(s)
- Zhuo Fang
- Changchun University of Finance and Economics, China
| | | | | | | | | |
Collapse
|
9
|
Abstract
The paper presents an application for automatically classifying emotions in film music. A model of emotions is proposed, which is also associated with colors. The model created has nine emotional states, to which colors are assigned according to the color theory in film. Subjective tests are carried out to check the correctness of the assumptions behind the adopted emotion model. For that purpose, a statistical analysis of the subjective test results is performed. The application employs a deep convolutional neural network (CNN), which classifies emotions based on 30 s excerpts of music works presented to the CNN input using mel-spectrograms. Examples of classification results of the selected neural networks used to create the system are shown.
Collapse
|
10
|
Pandeya YR, Bhattarai B, Lee J. Music video emotion classification using slow-fast audio-video network and unsupervised feature representation. Sci Rep 2021; 11:19834. [PMID: 34615904 PMCID: PMC8494760 DOI: 10.1038/s41598-021-98856-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Accepted: 09/13/2021] [Indexed: 12/02/2022] Open
Abstract
Affective computing has suffered by the precise annotation because the emotions are highly subjective and vague. The music video emotion is complex due to the diverse textual, acoustic, and visual information which can take the form of lyrics, singer voice, sounds from the different instruments, and visual representations. This can be one reason why there has been a limited study in this domain and no standard dataset has been produced before now. In this study, we proposed an unsupervised method for music video emotion analysis using music video contents on the Internet. We also produced a labelled dataset and compared the supervised and unsupervised methods for emotion classification. The music and video information are processed through a multimodal architecture with audio-video information exchange and boosting method. The general 2D and 3D convolution networks compared with the slow-fast network with filter and channel separable convolution in multimodal architecture. Several supervised and unsupervised networks were trained in an end-to-end manner and results were evaluated using various evaluation metrics. The proposed method used a large dataset for unsupervised emotion classification and interpreted the results quantitatively and qualitatively in the music video that had never been applied in the past. The result shows a large increment in classification score using unsupervised features and information sharing techniques on audio and video network. Our best classifier attained 77% accuracy, an f1-score of 0.77, and an area under the curve score of 0.94 with minimum computational cost.
Collapse
Affiliation(s)
- Yagya Raj Pandeya
- Department of Computer Science and Engineering, Jeonbuk National University, Jeonju, South Korea.
| | - Bhuwan Bhattarai
- Department of Computer Science and Engineering, Jeonbuk National University, Jeonju, South Korea
| | - Joonwhoan Lee
- Department of Computer Science and Engineering, Jeonbuk National University, Jeonju, South Korea.
| |
Collapse
|
11
|
Elgharabawy A, Prasad M, Lin CT. Subgroup Preference Neural Network. SENSORS (BASEL, SWITZERLAND) 2021; 21:6104. [PMID: 34577312 PMCID: PMC8471160 DOI: 10.3390/s21186104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 09/08/2021] [Accepted: 09/10/2021] [Indexed: 11/17/2022]
Abstract
Subgroup label ranking aims to rank groups of labels using a single ranking model, is a new problem faced in preference learning. This paper introduces the Subgroup Preference Neural Network (SGPNN) that combines multiple networks have different activation function, learning rate, and output layer into one artificial neural network (ANN) to discover the hidden relation between the subgroups' multi-labels. The SGPNN is a feedforward (FF), partially connected network that has a single middle layer and uses stairstep (SS) multi-valued activation function to enhance the prediction's probability and accelerate the ranking convergence. The novel structure of the proposed SGPNN consists of a multi-activation function neuron (MAFN) in the middle layer to rank each subgroup independently. The SGPNN uses gradient ascent to maximize the Spearman ranking correlation between the groups of labels. Each label is represented by an output neuron that has a single SS function. The proposed SGPNN using conjoint dataset outperforms the other label ranking methods which uses each dataset individually. The proposed SGPNN achieves an average accuracy of 91.4% using the conjoint dataset compared to supervised clustering, decision tree, multilayer perceptron label ranking and label ranking forests that achieve an average accuracy of 60%, 84.8%, 69.2% and 73%, respectively, using the individual dataset.
Collapse
Affiliation(s)
| | - Mukesh Prasad
- Australian Artificial Intelligence Institute, School of Computer Science, University of Technology Sydney, Ultimo, Sydney 2007, Australia; (A.E.); (C.-T.L.)
| | | |
Collapse
|