1
|
Baowaly MK, Sarkar BC, Walid MAA, Ahamad MM, Singh BC, Alvarado ES, Ashraf I, Samad MA. Deep transfer learning-based bird species classification using mel spectrogram images. PLoS One 2024; 19:e0305708. [PMID: 39133732 PMCID: PMC11318847 DOI: 10.1371/journal.pone.0305708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 06/04/2024] [Indexed: 08/15/2024] Open
Abstract
The classification of bird species is of significant importance in the field of ornithology, as it plays an important role in assessing and monitoring environmental dynamics, including habitat modifications, migratory behaviors, levels of pollution, and disease occurrences. Traditional methods of bird classification, such as visual identification, were time-intensive and required a high level of expertise. However, audio-based bird species classification is a promising approach that can be used to automate bird species identification. This study aims to establish an audio-based bird species classification system for 264 Eastern African bird species employing modified deep transfer learning. In particular, the pre-trained EfficientNet technique was utilized for the investigation. The study adapts the fine-tune model to learn the pertinent patterns from mel spectrogram images specific to this bird species classification task. The fine-tuned EfficientNet model combined with a type of Recurrent Neural Networks (RNNs) namely Gated Recurrent Unit (GRU) and Long short-term memory (LSTM). RNNs are employed to capture the temporal dependencies in audio signals, thereby enhancing bird species classification accuracy. The dataset utilized in this work contains nearly 17,000 bird sound recordings across a diverse range of species. The experiment was conducted with several combinations of EfficientNet and RNNs, and EfficientNet-B7 with GRU surpasses other experimental models with an accuracy of 84.03% and a macro-average precision score of 0.8342.
Collapse
Affiliation(s)
- Mrinal Kanti Baowaly
- Department of Computer Science and Engineering, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Bisnu Chandra Sarkar
- Department of Computer Science and Engineering, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Md. Abul Ala Walid
- Department of Computer Science and Engineering, Khulna University of Engineering and Technology, Khulna, Bangladesh
- Department of Data Science, Bangabandhu Sheikh Mujibur Rahman Digital University, Gazipur, Bangladesh
| | - Md. Martuza Ahamad
- Department of Computer Science and Engineering, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh
| | - Bikash Chandra Singh
- School of Cybersecurity, Old Dominion University, Norfolk, VA, United States of America
| | - Eduardo Silva Alvarado
- Universidad Europea del Atlántico, Santander, Spain
- Universidad Internacional Iberoamericana, Campeche, México
- Universidad de La Romana, La Romana, República Dominicana
| | - Imran Ashraf
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si, Gyeongsangbuk-do, South Korea
| | - Md. Abdus Samad
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si, Gyeongsangbuk-do, South Korea
| |
Collapse
|
2
|
Nieto-Mora DA, Ferreira de Oliveira MC, Sanchez-Giraldo C, Duque-Muñoz L, Isaza-Narváez C, Martínez-Vargas JD. Soundscape Characterization Using Autoencoders and Unsupervised Learning. SENSORS (BASEL, SWITZERLAND) 2024; 24:2597. [PMID: 38676214 PMCID: PMC11054175 DOI: 10.3390/s24082597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 03/19/2024] [Accepted: 03/20/2024] [Indexed: 04/28/2024]
Abstract
Passive acoustic monitoring (PAM) through acoustic recorder units (ARUs) shows promise in detecting early landscape changes linked to functional and structural patterns, including species richness, acoustic diversity, community interactions, and human-induced threats. However, current approaches primarily rely on supervised methods, which require prior knowledge of collected datasets. This reliance poses challenges due to the large volumes of ARU data. In this work, we propose a non-supervised framework using autoencoders to extract soundscape features. We applied this framework to a dataset from Colombian landscapes captured by 31 audiomoth recorders. Our method generates clusters based on autoencoder features and represents cluster information with prototype spectrograms using centroid features and the decoder part of the neural network. Our analysis provides valuable insights into the distribution and temporal patterns of various sound compositions within the study area. By utilizing autoencoders, we identify significant soundscape patterns characterized by recurring and intense sound types across multiple frequency ranges. This comprehensive understanding of the study area's soundscape allows us to pinpoint crucial sound sources and gain deeper insights into its acoustic environment. Our results encourage further exploration of unsupervised algorithms in soundscape analysis as a promising alternative path for understanding and monitoring environmental changes.
Collapse
Affiliation(s)
- Daniel Alexis Nieto-Mora
- Máquinas Inteligentes y Reconocimiento de Patrones (MIRP), Instituto Tecnológico Metropolitano ITM, Medellín 050034, Colombia;
| | | | - Camilo Sanchez-Giraldo
- Grupo Herpetológico de Antioquia, Institute of Biology, Universidad de Antioquia-UdeA, Medellín 050010, Colombia;
| | - Leonardo Duque-Muñoz
- Máquinas Inteligentes y Reconocimiento de Patrones (MIRP), Instituto Tecnológico Metropolitano ITM, Medellín 050034, Colombia;
| | - Claudia Isaza-Narváez
- SISTEMIC, Facultad de Ingeniería, Universidad de Antioquia-UdeA, Medellín 050010, Colombia;
| | | |
Collapse
|
3
|
Abbas S, Ojo S, Al Hejaili A, Sampedro GA, Almadhor A, Zaidi MM, Kryvinska N. Artificial intelligence framework for heart disease classification from audio signals. Sci Rep 2024; 14:3123. [PMID: 38326488 PMCID: PMC10850078 DOI: 10.1038/s41598-024-53778-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 02/05/2024] [Indexed: 02/09/2024] Open
Abstract
As cardiovascular disorders are prevalent, there is a growing demand for reliable and precise diagnostic methods within this domain. Audio signal-based heart disease detection is a promising area of research that leverages sound signals generated by the heart to identify and diagnose cardiovascular disorders. Machine learning (ML) and deep learning (DL) techniques are pivotal in classifying and identifying heart disease from audio signals. This study investigates ML and DL techniques to detect heart disease by analyzing noisy sound signals. This study employed two subsets of datasets from the PASCAL CHALLENGE having real heart audios. The research process and visually depict signals using spectrograms and Mel-Frequency Cepstral Coefficients (MFCCs). We employ data augmentation to improve the model's performance by introducing synthetic noise to the heart sound signals. In addition, a feature ensembler is developed to integrate various audio feature extraction techniques. Several machine learning and deep learning classifiers are utilized for heart disease detection. Among the numerous models studied and previous study findings, the multilayer perceptron model performed best, with an accuracy rate of 95.65%. This study demonstrates the potential of this methodology in accurately detecting heart disease from sound signals. These findings present promising opportunities for enhancing medical diagnosis and patient care.
Collapse
Affiliation(s)
- Sidra Abbas
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan.
| | - Stephen Ojo
- Department of Electrical and Computer Engineering, College of Engineering Anderson, Anderson, SC, 29621, USA
| | - Abdullah Al Hejaili
- Computer Science Department, Faculty of Computers and Information Technology, University of Tabuk, Tabuk, 71491, Saudi Arabia
| | - Gabriel Avelino Sampedro
- Faculty of Information and Communication Studies, University of the Philippines Open University, Los Baños, 4031, Philippines
- Center for Computational Imaging and Visual Innovations, De La Salle University, 2401 Taft Ave., Malate, 1004, Manila, Philippines
| | - Ahmad Almadhor
- Department of Computer Engineering and Networks, College of Computer and Information Sciences, Jouf University, 72388, Sakaka, Saudi Arabia
| | - Monji Mohamed Zaidi
- Department of Electrical Engineering, College of Engineering, King Khalid University, Abha, Saudi Arabia
| | - Natalia Kryvinska
- Information Systems Department, Faculty of Management, Comenius University in Bratislava, Odbojárov 10, 82005, Bratislava 25, Slovakia.
| |
Collapse
|
4
|
Ma B, Gao R, Zhang J, Zhu X. A YOLOX-Based Automatic Monitoring Approach of Broken Wires in Prestressed Concrete Cylinder Pipe Using Fiber-Optic Distributed Acoustic Sensors. SENSORS (BASEL, SWITZERLAND) 2023; 23:2090. [PMID: 36850690 PMCID: PMC9963517 DOI: 10.3390/s23042090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 01/20/2023] [Accepted: 02/08/2023] [Indexed: 06/18/2023]
Abstract
Wire breakage is a major factor in the failure of prestressed concrete cylinder pipes (PCCP). In the presented work, an automatic monitoring approach of broken wires in PCCP using fiber-optic distributed acoustic sensors (DAS) is investigated. The study designs a 1:1 prototype wire break monitoring experiment using a DN4000 mm PCCP buried underground in a simulated test environment. The test combines the collected wire break signals with the previously collected noise signals in the operating pipe and transforms them into a spectrogram as the wire break signal dataset. A deep learning-based target detection algorithm is developed to detect the occurrence of wire break events by extracting the spectrogram image features of wire break signals in the dataset. The results show that the recall, precision, F1 score, and false detection rate of the pruned model reach 100%, 100%, 1, and 0%, respectively; the video detection frame rate reaches 35 fps and the model size is only 732 KB. It can be seen that this method greatly simplifies the model without loss of precision, providing an effective method for the identification of PCCP wire break signals, while the lightweight model is more conducive to the embedded deployment of a PCCP wire break monitoring system.
Collapse
Affiliation(s)
- Baolong Ma
- School of Water Conservancy and Hydroelectric Power, Hebei University of Engineering, Handan 056038, China
- School of Mechanical and Equipment Engineering, Hebei University of Engineering, Handan 056038, China
- Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, China Institute of Water Resources and Hydropower Research (IWHR), Beijing 100038, China
| | - Ruizhen Gao
- School of Mechanical and Equipment Engineering, Hebei University of Engineering, Handan 056038, China
| | - Jingjun Zhang
- School of Mechanical and Equipment Engineering, Hebei University of Engineering, Handan 056038, China
| | - Xinmin Zhu
- Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, China Institute of Water Resources and Hydropower Research (IWHR), Beijing 100038, China
| |
Collapse
|
5
|
Bandara M, Jayasundara R, Ariyarathne I, Meedeniya D, Perera C. Forest Sound Classification Dataset: FSC22. SENSORS (BASEL, SWITZERLAND) 2023; 23:2032. [PMID: 36850626 PMCID: PMC9966992 DOI: 10.3390/s23042032] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 01/28/2023] [Accepted: 02/07/2023] [Indexed: 06/18/2023]
Abstract
The study of environmental sound classification (ESC) has become popular over the years due to the intricate nature of environmental sounds and the evolution of deep learning (DL) techniques. Forest ESC is one use case of ESC, which has been widely experimented with recently to identify illegal activities inside a forest. However, at present, there is a limitation of public datasets specific to all the possible sounds in a forest environment. Most of the existing experiments have been done using generic environment sound datasets such as ESC-50, U8K, and FSD50K. Importantly, in DL-based sound classification, the lack of quality data can cause misguided information, and the predictions obtained remain questionable. Hence, there is a requirement for a well-defined benchmark forest environment sound dataset. This paper proposes FSC22, which fills the gap of a benchmark dataset for forest environmental sound classification. It includes 2025 sound clips under 27 acoustic classes, which contain possible sounds in a forest environment. We discuss the procedure of dataset preparation and validate it through different baseline sound classification models. Additionally, it provides an analysis of the new dataset compared to other available datasets. Therefore, this dataset can be used by researchers and developers who are working on forest observatory tasks.
Collapse
Affiliation(s)
- Meelan Bandara
- Department of Computer Science & Engineering, University of Moratuwa, Moratuwa 10400, Sri Lanka
| | - Roshinie Jayasundara
- Department of Computer Science & Engineering, University of Moratuwa, Moratuwa 10400, Sri Lanka
| | - Isuru Ariyarathne
- Department of Computer Science & Engineering, University of Moratuwa, Moratuwa 10400, Sri Lanka
| | - Dulani Meedeniya
- Department of Computer Science & Engineering, University of Moratuwa, Moratuwa 10400, Sri Lanka
| | - Charith Perera
- School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
| |
Collapse
|
6
|
Nogueira AFR, Oliveira HS, Machado JJM, Tavares JMRS. Transformers for Urban Sound Classification-A Comprehensive Performance Evaluation. SENSORS (BASEL, SWITZERLAND) 2022; 22:8874. [PMID: 36433471 PMCID: PMC9699161 DOI: 10.3390/s22228874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/08/2022] [Revised: 11/12/2022] [Accepted: 11/12/2022] [Indexed: 06/16/2023]
Abstract
Many relevant sound events occur in urban scenarios, and robust classification models are required to identify abnormal and relevant events correctly. These models need to identify such events within valuable time, being effective and prompt. It is also essential to determine for how much time these events prevail. This article presents an extensive analysis developed to identify the best-performing model to successfully classify a broad set of sound events occurring in urban scenarios. Analysis and modelling of Transformer models were performed using available public datasets with different sets of sound classes. The Transformer models' performance was compared to the one achieved by the baseline model and end-to-end convolutional models. Furthermore, the benefits of using pre-training from image and sound domains and data augmentation techniques were identified. Additionally, complementary methods that have been used to improve the models' performance and good practices to obtain robust sound classification models were investigated. After an extensive evaluation, it was found that the most promising results were obtained by employing a Transformer model using a novel Adam optimizer with weight decay and transfer learning from the audio domain by reusing the weights from AudioSet, which led to an accuracy score of 89.8% for the UrbanSound8K dataset, 95.8% for the ESC-50 dataset, and 99% for the ESC-10 dataset, respectively.
Collapse
Affiliation(s)
| | - Hugo S. Oliveira
- Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
| | - José J. M. Machado
- Departamento de Engenharia Mecânica, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
| | - João Manuel R. S. Tavares
- Departamento de Engenharia Mecânica, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
| |
Collapse
|
7
|
Nogueira AFR, Oliveira HS, Machado JJM, Tavares JMRS. Sound Classification and Processing of Urban Environments: A Systematic Literature Review. SENSORS (BASEL, SWITZERLAND) 2022; 22:8608. [PMID: 36433204 PMCID: PMC9698075 DOI: 10.3390/s22228608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 10/31/2022] [Accepted: 11/03/2022] [Indexed: 06/16/2023]
Abstract
Audio recognition can be used in smart cities for security, surveillance, manufacturing, autonomous vehicles, and noise mitigation, just to name a few. However, urban sounds are everyday audio events that occur daily, presenting unstructured characteristics containing different genres of noise and sounds unrelated to the sound event under study, making it a challenging problem. Therefore, the main objective of this literature review is to summarize the most recent works on this subject to understand the current approaches and identify their limitations. Based on the reviewed articles, it can be realized that Deep Learning (DL) architectures, attention mechanisms, data augmentation techniques, and pretraining are the most crucial factors to consider while creating an efficient sound classification model. The best-found results were obtained by Mushtaq and Su, in 2020, using a DenseNet-161 with pretrained weights from ImageNet, and NA-1 and NA-2 as augmentation techniques, which were of 97.98%, 98.52%, and 99.22% for UrbanSound8K, ESC-50, and ESC-10 datasets, respectively. Nonetheless, the use of these models in real-world scenarios has not been properly addressed, so their effectiveness is still questionable in such situations.
Collapse
Affiliation(s)
| | - Hugo S. Oliveira
- Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
| | - José J. M. Machado
- Departamento de Engenharia Mecânica, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
| | - João Manuel R. S. Tavares
- Departamento de Engenharia Mecânica, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal
| |
Collapse
|
8
|
Xu D, Zhang Z, Shi J. A New Multi-Sensor Stream Data Augmentation Method for Imbalanced Learning in Complex Manufacturing Process. SENSORS (BASEL, SWITZERLAND) 2022; 22:4042. [PMID: 35684662 PMCID: PMC9185280 DOI: 10.3390/s22114042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 05/23/2022] [Accepted: 05/24/2022] [Indexed: 06/15/2023]
Abstract
Multiple sensors are often mounted in a complex manufacturing process to detect failures. Due to the high reliability of modern manufacturing processes, failures only happen occasionally. Therefore, data collected in practical manufacturing processes are extremely imbalanced, which often brings about bias of supervised learning models. Data collected by the multiple sensors can be regarded as multivariate time series or multi-sensor stream data. The high dimension of multi-sensor stream data makes building models even more challenging. In this study, a new and easy-to-apply data augmentation approach, namely, imbalanced multi-sensor stream data augmentation (IMSDA), is proposed for imbalanced learning. IMSDA can generate high quality of failure data for all dimensions. The generated data can keep the similar temporal property of the original multivariate time series. Both raw data and generated data are used to train the failure detection models, but the models are tested by the same real dataset. The proposed method is applied to a real-world industry case. Results show that IMSDA can not only obtain good quality failure data to reduce the imbalance level but also significantly improve the performance of supervised failure detection models.
Collapse
Affiliation(s)
- Dongting Xu
- School of Mechanical Engineering, Southeast University, Nanjing 211189, China; (D.X.); (Z.Z.)
- School of Mechanical Engineering, Nanjing Institute of Technology, Nanjing 211167, China
| | - Zhisheng Zhang
- School of Mechanical Engineering, Southeast University, Nanjing 211189, China; (D.X.); (Z.Z.)
| | - Jinfei Shi
- School of Mechanical Engineering, Southeast University, Nanjing 211189, China; (D.X.); (Z.Z.)
- School of Mechanical Engineering, Nanjing Institute of Technology, Nanjing 211167, China
| |
Collapse
|
9
|
Trapanotto M, Nanni L, Brahnam S, Guo X. Convolutional Neural Networks for the Identification of African Lions from Individual Vocalizations. J Imaging 2022; 8:jimaging8040096. [PMID: 35448223 PMCID: PMC9029749 DOI: 10.3390/jimaging8040096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 03/17/2022] [Accepted: 03/29/2022] [Indexed: 02/05/2023] Open
Abstract
The classification of vocal individuality for passive acoustic monitoring (PAM) and census of animals is becoming an increasingly popular area of research. Nearly all studies in this field of inquiry have relied on classic audio representations and classifiers, such as Support Vector Machines (SVMs) trained on spectrograms or Mel-Frequency Cepstral Coefficients (MFCCs). In contrast, most current bioacoustic species classification exploits the power of deep learners and more cutting-edge audio representations. A significant reason for avoiding deep learning in vocal identity classification is the tiny sample size in the collections of labeled individual vocalizations. As is well known, deep learners require large datasets to avoid overfitting. One way to handle small datasets with deep learning methods is to use transfer learning. In this work, we evaluate the performance of three pretrained CNNs (VGG16, ResNet50, and AlexNet) on a small, publicly available lion roar dataset containing approximately 150 samples taken from five male lions. Each of these networks is retrained on eight representations of the samples: MFCCs, spectrogram, and Mel spectrogram, along with several new ones, such as VGGish and stockwell, and those based on the recently proposed LM spectrogram. The performance of these networks, both individually and in ensembles, is analyzed and corroborated using the Equal Error Rate and shown to surpass previous classification attempts on this dataset; the best single network achieved over 95% accuracy and the best ensembles over 98% accuracy. The contributions this study makes to the field of individual vocal classification include demonstrating that it is valuable and possible, with caution, to use transfer learning with single pretrained CNNs on the small datasets available for this problem domain. We also make a contribution to bioacoustics generally by offering a comparison of the performance of many state-of-the-art audio representations, including for the first time the LM spectrogram and stockwell representations. All source code for this study is available on GitHub.
Collapse
Affiliation(s)
- Martino Trapanotto
- Department of Information Engineering, University of Padua, Via Gradenigo 6, 35131 Padova, Italy; (M.T.); (L.N.)
| | - Loris Nanni
- Department of Information Engineering, University of Padua, Via Gradenigo 6, 35131 Padova, Italy; (M.T.); (L.N.)
| | - Sheryl Brahnam
- Information Technology and Cybersecurity, Missouri State University, 901 S. National, Springfield, MO 65897, USA;
- Correspondence: ; Tel.: +1-417-873-9979
| | - Xiang Guo
- Information Technology and Cybersecurity, Missouri State University, 901 S. National, Springfield, MO 65897, USA;
| |
Collapse
|
10
|
Dayal A, Yeduri SR, Koduru BH, Jaiswal RK, Soumya J, Srinivas MB, Pandey OJ, Cenkeramaddi LR. Lightweight deep convolutional neural network for background sound classification in speech signals. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 151:2773. [PMID: 35461490 DOI: 10.1121/10.0010257] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Accepted: 03/30/2022] [Indexed: 06/14/2023]
Abstract
Recognizing background information in human speech signals is a task that is extremely useful in a wide range of practical applications, and many articles on background sound classification have been published. It has not, however, been addressed with background embedded in real-world human speech signals. Thus, this work proposes a lightweight deep convolutional neural network (CNN) in conjunction with spectrograms for an efficient background sound classification with practical human speech signals. The proposed model classifies 11 different background sounds such as airplane, airport, babble, car, drone, exhibition, helicopter, restaurant, station, street, and train sounds embedded in human speech signals. The proposed deep CNN model consists of four convolution layers, four max-pooling layers, and one fully connected layer. The model is tested on human speech signals with varying signal-to-noise ratios (SNRs). Based on the results, the proposed deep CNN model utilizing spectrograms achieves an overall background sound classification accuracy of 95.2% using the human speech signals with a wide range of SNRs. It is also observed that the proposed model outperforms the benchmark models in terms of both accuracy and inference time when evaluated on edge computing devices.
Collapse
Affiliation(s)
- Aveen Dayal
- Department of ICT, University of Agder, Grimstad 4879, Norway
| | | | | | | | - J Soumya
- Birla Institute of Technology and Science-Pilani, Hyderabad, India
| | - M B Srinivas
- Birla Institute of Technology and Science-Pilani, Hyderabad, India
| | - Om Jee Pandey
- Department of Electronics Engineering, IIT BHU Varanasi, Varanasi 221005, India
| | | |
Collapse
|