Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

被引：0

作者：

Lin, Yist Y. ^{[1
]}

Han, Tao ^{[1
]}

Xu, Haihua ^{[1
]}

Van Tung Pham ^{[1
]}

Khassanov, Yerbolat ^{[1
]}

Chong, Tze Yuang ^{[1
]}

He, Yi ^{[1
]}

Lu, Lu ^{[1
]}

Ma, Zejun ^{[1
]}

机构：

[1] ByteDance, Beijing, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

random utterance concatenation; data augmentation; short video; end-to-end; speech recognition;

D O I：

10.21437/Interspeech.2023-1272

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (similar to 3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (similar to 10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.

引用

页码：904 / 908

页数：5

共 50 条

[41] GENERATIVE ADVERSARIAL NETWORKS BASED DATA AUGMENTATION FOR NOISE ROBUST SPEECH RECOGNITION
Hu, Hu
Tan, Tian
Qian, Yanmin
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5044 - 5048
[42] Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition
Ueno, Sei
Lee, Akinobu
Kawahara, Tatsuya
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3924 - 3933
[43] Adaptive data augmentation for mandarin automatic speech recognition
Ding, Kai
Li, Ruixuan
Xu, Yuelin
Du, Xingyue
Deng, Bin
[J]. APPLIED INTELLIGENCE, 2024, 54 (07) : 5674 - 5687
[44] Data Augmentation using GANs for Speech Emotion Recognition
Chatziagapi, Aggelina
Paraskevopoulos, Georgios
Sgouropoulos, Dimitris
Pantazopoulos, Georgios
Nikandrou, Malvina
Giannakopoulos, Theodoros
Katsamanis, Athanasios
Potamianos, Alexandros
Narayanan, Shrikanth
[J]. INTERSPEECH 2019, 2019, : 171 - 175
[45] Improving CNN-based activity recognition by data augmentation and transfer learning
Kalouris, Gerasimos
Zacharaki, Evangelia I.
Megalooikonomou, Vasileios
[J]. 2019 IEEE 17TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2019, : 1387 - 1394
[46] Multi-setting acoustic feature training for data augmentation of speech recognition
Ueno, Sei
Lee, Akinobu
[J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2024, 45 (04) : 195 - 203
[47] Two-Stage Data Augmentation for Low-Resourced Speech Recognition
Hartmann, William
Ng, Tim
Hsiao, Roger
Tsakalidis, Stavros
Schwartz, Richard
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2378 - 2382
[48] CycleGAN-based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition
Bao, Fang
Neumann, Michael
Ngoc Thang Vu
[J]. INTERSPEECH 2019, 2019, : 2828 - 2832
[49] Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation
Roh, Kyung-Min
Lee, Seok-Pil
[J]. APPLIED SCIENCES-BASEL, 2024, 14 (21):
[50] Data augmentation method based on three-dimensional measurement for silent speech recognition
Ota, Kenko
[J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2024, 45 (06) : 329 - 332

← 1 2 3 4 5 →