Adaptive data augmentation for mandarin automatic speech recognition

被引:2
作者
Ding, Kai [1 ]
Li, Ruixuan [2 ]
Xu, Yuelin [1 ]
Du, Xingyue [3 ]
Deng, Bin [1 ]
机构
[1] Sci & Technol Near Surface Detect Lab, Wuxi, Peoples R China
[2] Shanghai Acad Spaceflight Technol, Shanghai Aerosp Control Technol Inst, Shanghai, Peoples R China
[3] CSSC Ocean Explorat Technol Res Inst Co Ltd, Wuxi, Peoples R China
关键词
Adaptive data augmentation; Data efficiency; Deep clustering; Speech recognition;
D O I
10.1007/s10489-024-05381-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio data augmentation is widely adopted in automatic speech recognition (ASR) to alleviate the overfitting problem. However, noise-based data augmentation converts an over-fitting problem into an under-fitting problem which increases the training time severely. With noise-based data augmentation, informative features are not be persisted during the generating process and generated audio clips would become noise data for the acoustic model. To face the challenge, we propose an Adaptive audio Data Augmentation method called ADA with deep clustering. The proposed ADA could automatically select the most informative augmented sample for each generation. Moreover, two sample selection strategies called RM and RS are proposed. The proposed RM removes samples whose embedding are far away from the cluster center, while the proposed RS maintains the diversity of augmentation samples by sampling in each cluster. Experiments on Aishell-1 demonstrate that the proposed ADA method could improve the data efficiency of end-to-end ASR model in both CNN-based and Transformer-based networks. The proposed ADA obtains an 11.28% and 5.95% relative improvement on SS-CNN and LS-CNN, and a 4.35% improvement on S-Transformer compared with the state-of-the-art audio data augmentation method. Meanwhile, the proposed ADA method decreases the demand of augmented samples by 2.7 times in SS-CNN, LS-CNN and S-Transformer. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed ADA method.
引用
收藏
页码:5674 / 5687
页数:14
相关论文
共 50 条
  • [1] Improving deep speech denoising by Noisy2Noisy signal mapping
    Alamdari, N.
    Azarang, A.
    Kehtarnavaz, N.
    [J]. APPLIED ACOUSTICS, 2021, 172
  • [2] Amodei D, 2016, PR MACH LEARN RES, V48
  • [3] Revisiting Internal Covariate Shift for Batch Normalization
    Awais, Muhammad
    Bin Iqbal, Md Tauhid
    Bae, Sung-Ho
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (11) : 5082 - 5092
  • [4] Modification of hybrid RNN-HMM model in asset pricing: univariate and multivariate cases
    Aydogan-Kilic, Dilek
    Selcuk-Kestel, A. Sevtap
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23812 - 23833
  • [5] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
  • [6] Deep Clustering for Unsupervised Learning of Visual Features
    Caron, Mathilde
    Bojanowski, Piotr
    Joulin, Armand
    Douze, Matthijs
    [J]. COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 139 - 156
  • [7] Chen L-W, 2023, ICASSP 2023 2023 IEE, P1, DOI DOI 10.1109/ICASSP49357.2023.10095036
  • [8] AutoAugment: Learning Augmentation Strategies from Data
    Cubuk, Ekin D.
    Zoph, Barret
    Mane, Dandelion
    Vasudevan, Vijay
    Le, Quoc V.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 113 - 123
  • [9] Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition
    Dokuz, Yesim
    Tufekci, Zekeriya
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (07) : 9969 - 9988
  • [10] Spelling-Aware Word-Based End-to-End ASR
    Egorova, Ekaterina
    Vydana, Hari Krishna
    Burget, Lukas
    Cernocky, Jan Honza
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1729 - 1733