Adaptive data augmentation for mandarin automatic speech recognition

被引：2

作者：

Ding, Kai ^{[1
]}

Li, Ruixuan ^{[2
]}

Xu, Yuelin ^{[1
]}

Du, Xingyue ^{[3
]}

Deng, Bin ^{[1
]}

机构：

[1] Sci & Technol Near Surface Detect Lab, Wuxi, Peoples R China

[2] Shanghai Acad Spaceflight Technol, Shanghai Aerosp Control Technol Inst, Shanghai, Peoples R China

[3] CSSC Ocean Explorat Technol Res Inst Co Ltd, Wuxi, Peoples R China

来源：

APPLIED INTELLIGENCE | 2024年 / 54卷 / 07期

关键词：

Adaptive data augmentation; Data efficiency; Deep clustering; Speech recognition;

D O I：

10.1007/s10489-024-05381-6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio data augmentation is widely adopted in automatic speech recognition (ASR) to alleviate the overfitting problem. However, noise-based data augmentation converts an over-fitting problem into an under-fitting problem which increases the training time severely. With noise-based data augmentation, informative features are not be persisted during the generating process and generated audio clips would become noise data for the acoustic model. To face the challenge, we propose an Adaptive audio Data Augmentation method called ADA with deep clustering. The proposed ADA could automatically select the most informative augmented sample for each generation. Moreover, two sample selection strategies called RM and RS are proposed. The proposed RM removes samples whose embedding are far away from the cluster center, while the proposed RS maintains the diversity of augmentation samples by sampling in each cluster. Experiments on Aishell-1 demonstrate that the proposed ADA method could improve the data efficiency of end-to-end ASR model in both CNN-based and Transformer-based networks. The proposed ADA obtains an 11.28% and 5.95% relative improvement on SS-CNN and LS-CNN, and a 4.35% improvement on S-Transformer compared with the state-of-the-art audio data augmentation method. Meanwhile, the proposed ADA method decreases the demand of augmented samples by 2.7 times in SS-CNN, LS-CNN and S-Transformer. The qualitative and quantitative analysis proves the effectiveness and efficiency of the proposed ADA method.

引用

页码：5674 / 5687

页数：14

共 50 条

[1] Improving deep speech denoising by Noisy2Noisy signal mapping
Alamdari, N.
Azarang, A.
Kehtarnavaz, N.
[J]. APPLIED ACOUSTICS, 2021, 172
[2] Amodei D, 2016, PR MACH LEARN RES, V48
[3] Revisiting Internal Covariate Shift for Batch Normalization
Awais, Muhammad
Bin Iqbal, Md Tauhid
Bae, Sung-Ho
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (11) : 5082 - 5092
[4] Modification of hybrid RNN-HMM model in asset pricing: univariate and multivariate cases
Aydogan-Kilic, Dilek
Selcuk-Kestel, A. Sevtap
[J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23812 - 23833
[5] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[6] Deep Clustering for Unsupervised Learning of Visual Features
Caron, Mathilde
Bojanowski, Piotr
Joulin, Armand
Douze, Matthijs
[J]. COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 139 - 156
[7] Chen L-W, 2023, ICASSP 2023 2023 IEE, P1, DOI DOI 10.1109/ICASSP49357.2023.10095036
[8] AutoAugment: Learning Augmentation Strategies from Data
Cubuk, Ekin D.
Zoph, Barret
Mane, Dandelion
Vasudevan, Vijay
Le, Quoc V.
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 113 - 123
[9] Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition
Dokuz, Yesim
Tufekci, Zekeriya
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (07) : 9969 - 9988
[10] Spelling-Aware Word-Based End-to-End ASR
Egorova, Ekaterina
Vydana, Hari Krishna
Burget, Lukas
Cernocky, Jan Honza
[J]. IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1729 - 1733

← 1 2 3 4 5 →