Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

被引:0
作者
Lin, Yist Y. [1 ]
Han, Tao [1 ]
Xu, Haihua [1 ]
Van Tung Pham [1 ]
Khassanov, Yerbolat [1 ]
Chong, Tze Yuang [1 ]
He, Yi [1 ]
Lu, Lu [1 ]
Ma, Zejun [1 ]
机构
[1] ByteDance, Beijing, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
random utterance concatenation; data augmentation; short video; end-to-end; speech recognition;
D O I
10.21437/Interspeech.2023-1272
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (similar to 3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (similar to 10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.
引用
收藏
页码:904 / 908
页数:5
相关论文
共 50 条
  • [41] GENERATIVE ADVERSARIAL NETWORKS BASED DATA AUGMENTATION FOR NOISE ROBUST SPEECH RECOGNITION
    Hu, Hu
    Tan, Tian
    Qian, Yanmin
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5044 - 5048
  • [42] Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition
    Ueno, Sei
    Lee, Akinobu
    Kawahara, Tatsuya
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3924 - 3933
  • [43] Adaptive data augmentation for mandarin automatic speech recognition
    Ding, Kai
    Li, Ruixuan
    Xu, Yuelin
    Du, Xingyue
    Deng, Bin
    [J]. APPLIED INTELLIGENCE, 2024, 54 (07) : 5674 - 5687
  • [44] Data Augmentation using GANs for Speech Emotion Recognition
    Chatziagapi, Aggelina
    Paraskevopoulos, Georgios
    Sgouropoulos, Dimitris
    Pantazopoulos, Georgios
    Nikandrou, Malvina
    Giannakopoulos, Theodoros
    Katsamanis, Athanasios
    Potamianos, Alexandros
    Narayanan, Shrikanth
    [J]. INTERSPEECH 2019, 2019, : 171 - 175
  • [45] Improving CNN-based activity recognition by data augmentation and transfer learning
    Kalouris, Gerasimos
    Zacharaki, Evangelia I.
    Megalooikonomou, Vasileios
    [J]. 2019 IEEE 17TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2019, : 1387 - 1394
  • [46] Multi-setting acoustic feature training for data augmentation of speech recognition
    Ueno, Sei
    Lee, Akinobu
    [J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2024, 45 (04) : 195 - 203
  • [47] Two-Stage Data Augmentation for Low-Resourced Speech Recognition
    Hartmann, William
    Ng, Tim
    Hsiao, Roger
    Tsakalidis, Stavros
    Schwartz, Richard
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 2378 - 2382
  • [48] CycleGAN-based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition
    Bao, Fang
    Neumann, Michael
    Ngoc Thang Vu
    [J]. INTERSPEECH 2019, 2019, : 2828 - 2832
  • [49] Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation
    Roh, Kyung-Min
    Lee, Seok-Pil
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (21):
  • [50] Data augmentation method based on three-dimensional measurement for silent speech recognition
    Ota, Kenko
    [J]. ACOUSTICAL SCIENCE AND TECHNOLOGY, 2024, 45 (06) : 329 - 332