1
|
Argade D, Khairnar V, Vora D, Patil S, Kotecha K, Alfarhood S. Multimodal Abstractive Summarization using bidirectional encoder representations from transformers with attention mechanism. Heliyon 2024; 10:e26162. [PMID: 38420442 PMCID: PMC10900395 DOI: 10.1016/j.heliyon.2024.e26162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 01/28/2024] [Accepted: 02/08/2024] [Indexed: 03/02/2024] Open
Abstract
In recent decades, abstractive text summarization using multimodal input has attracted many researchers due to the capability of gathering information from various sources to create a concise summary. However, the existing methodologies based on multimodal summarization provide only a summary for the short videos and poor results for the lengthy videos. To address the aforementioned issues, this research presented the Multimodal Abstractive Summarization using Bidirectional Encoder Representations from Transformers (MAS-BERT) with an attention mechanism. The purpose of the video summarization is to increase the speed of searching for a large collection of videos so that the users can quickly decide whether the video is relevant or not by reading the summary. Initially, the data is obtained from the publicly available How2 dataset and is encoded using the Bidirectional Gated Recurrent Unit (Bi-GRU) encoder and the Long Short Term Memory (LSTM) encoder. The textual data which is embedded in the embedding layer is encoded using a bidirectional GRU encoder and the features with audio and video data are encoded with LSTM encoder. After this, BERT based attention mechanism is used to combine the modalities and finally, the BI-GRU based decoder is used for summarizing the multimodalities. The results obtained through the experiments that show the proposed MAS-BERT has achieved a better result of 60.2 for Rouge-1 whereas, the existing Decoder-only Multimodal Transformer (D-MmT) and the Factorized Multimodal Transformer based Decoder Only Language model (FLORAL) has achieved 49.58 and 56.89 respectively. Our work facilitates users by providing better contextual information and user experience and would help video-sharing platforms for customer retention by allowing users to search for relevant videos by looking at its summary.
Collapse
Affiliation(s)
- Dakshata Argade
- Terna Engineering College, Nerul, Navi Mumbai, 400706, India
| | | | - Deepali Vora
- Symbiosis Institute of Technology, Pune Campus, Symbiosis International (Deemed University), Pune, 412115, India
| | - Shruti Patil
- Symbiosis Institute of Technology, Pune Campus, Symbiosis International (Deemed University), Pune, 412115, India
- Symbiosis Centre for Applied Artificial Intelligence (SCAAI), Symbiosis Institute of Technology Pune Campus, Symbiosis International (Deemed University) (SIU), Lavale, Pune, 412115, India
| | - Ketan Kotecha
- Symbiosis Centre for Applied Artificial Intelligence (SCAAI), Symbiosis Institute of Technology Pune Campus, Symbiosis International (Deemed University) (SIU), Lavale, Pune, 412115, India
| | - Sultan Alfarhood
- Department of Computer Science, College of Computer and Information Sciences, King Saud University, P.O.Box 51178, Riyadh, 11543, Saudi Arabia
| |
Collapse
|
2
|
An event-based opinion summarization model for long chinese text with sentiment awareness and parameter fusion mechanism. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03231-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|