1
|
Laptev A, Andrusenko A, Podluzhny I, Mitrofanov A, Medennikov I, Matveev Y. Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition. Sensors (Basel) 2021; 21:s21093063. [PMID: 33924798 PMCID: PMC8124527 DOI: 10.3390/s21093063] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Revised: 04/23/2021] [Accepted: 04/25/2021] [Indexed: 11/16/2022]
Abstract
With the rapid development of speech assistants, adapting server-intended automatic speech recognition (ASR) solutions to a direct device has become crucial. For on-device speech recognition tasks, researchers and industry prefer end-to-end ASR systems as they can be made resource-efficient while maintaining a higher quality compared to hybrid systems. However, building end-to-end models requires a significant amount of speech data. Personalization, which is mainly handling out-of-vocabulary (OOV) words, is another challenging task associated with speech assistants. In this work, we consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate, embodied in Babel Turkish and Babel Georgian tasks. We propose a method of dynamic acoustic unit augmentation based on the Byte Pair Encoding with dropout (BPE-dropout) technique. The method non-deterministically tokenizes utterances to extend the token's contexts and to regularize their distribution for the model's recognition of unseen words. It also reduces the need for optimal subword vocabulary size search. The technique provides a steady improvement in regular and personalized (OOV-oriented) speech recognition tasks (at least 6% relative word error rate (WER) and 25% relative F-score) at no additional computational cost. Owing to the BPE-dropout use, our monolingual Turkish Conformer has achieved a competitive result with 22.2% character error rate (CER) and 38.9% WER, which is close to the best published multilingual system.
Collapse
Affiliation(s)
- Aleksandr Laptev
- Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia; (A.A.); (I.P.); (A.M.); (I.M.); (Y.M.)
- Correspondence:
| | - Andrei Andrusenko
- Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia; (A.A.); (I.P.); (A.M.); (I.M.); (Y.M.)
| | - Ivan Podluzhny
- Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia; (A.A.); (I.P.); (A.M.); (I.M.); (Y.M.)
| | - Anton Mitrofanov
- Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia; (A.A.); (I.P.); (A.M.); (I.M.); (Y.M.)
- STC-Innovations Ltd., 194044 Saint-Petersburg, Russia
| | - Ivan Medennikov
- Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia; (A.A.); (I.P.); (A.M.); (I.M.); (Y.M.)
- STC-Innovations Ltd., 194044 Saint-Petersburg, Russia
| | - Yuri Matveev
- Corporate Laboratory of Human-Machine Interaction Technologies, Information Technologies and Programming Faculty, School of Translational Information Technologies, ITMO University, 196084 Saint-Petersburg, Russia; (A.A.); (I.P.); (A.M.); (I.M.); (Y.M.)
- STC-Innovations Ltd., 194044 Saint-Petersburg, Russia
| |
Collapse
|