Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance

被引:0
作者
Perezhohin, Yuriy [1 ,2 ]
Santos, Tiago [1 ,2 ]
Costa, Victor [1 ,2 ]
Peres, Fernando [1 ]
Castelli, Mauro [2 ]
机构
[1] MyNorth AI Res, P-2780125 Oeiras, Portugal
[2] Univ NOVA Lisboa, NOVA Informat Management Sch NOVA IMS, Campus Campolide, P-1070312 Lisbon, Portugal
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Hidden Markov models; Feature extraction; Filtering; Data models; Synthetic data; Training; Contrastive learning; Accuracy; Adaptation models; Transformers; Automatic speech recognition; Text to speech; contrastive learning; data augmentation; embeddings; synthetic data filtering; text-to-speech; REPRESENTATIONS;
D O I
10.1109/ACCESS.2024.3482970
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent filtering, especially on non-normalized text. This work highlights the importance of adjusting synthetic data augmentation and filtering to specific model architectures and target domains. The proposed method, robust and adaptable, enhances ASR performance across diverse language settings. We have open-sourced the entire work, which includes 140 hours of synthetically generated Portuguese speech, as well as the pipeline and parameter settings used to create these samples. Additionally, we provide the fine-tuned Whisper models and the code required to reproduce this research. Our code will be available at https://github.com/my-north-ai/semantic_audio_filtering.
引用
收藏
页码:155136 / 155150
页数:15
相关论文
共 50 条
  • [41] Automatic Speech Recognition of Disordered Speech: Personalized models outperforming human listeners on short phrases
    Green, Jordan R.
    MacDonald, Robert L.
    Jiang, Pan-Pan
    Cattiau, Julie
    Heywood, Rus
    Cave, Richard
    Seaver, Katie
    Ladewig, Marilyn A.
    Tobin, Jimmy
    Brenner, Michael P.
    Nelson, Philip C.
    Tomanek, Katrin
    INTERSPEECH 2021, 2021, : 4778 - 4782
  • [42] Evolutionary structure of hidden Markov models for audio-visual Arabic speech recognition
    Makhlouf, Amina
    Lazli, Lilia
    Bensaker, Bachir
    INTERNATIONAL JOURNAL OF SIGNAL AND IMAGING SYSTEMS ENGINEERING, 2016, 9 (01) : 55 - 66
  • [43] CRDNN-BiLSTM Knowledge Distillation Model Towards Enhancing the Automatic Speech Recognition
    Ashok Kumar L.
    Karthika Renuka D.
    Naveena K.S.
    Sree Resmi S.
    SN Computer Science, 5 (3)
  • [44] A Study on Combining VTLN and SAT to Improve the Performance of Automatic Speech Recognition
    Sanand, D. R.
    Kurimo, Mikko
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2592 - 2595
  • [45] Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech
    Kumalija, Elhard
    Nakamoto, Yukikazu
    FRONTIERS IN SIGNAL PROCESSING, 2022, 2
  • [46] Deep Neural Network for Automatic Speech Recognition from Indonesian Audio using Several Lexicon Types
    Abidin, Taufik Fuadi
    Misbullah, Alim
    Ferdhiana, Ridha
    Aksana, Muammar Zikri
    Farsiah, Laina
    2020 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICELTICS 2020), 2020, : 113 - 117
  • [47] Using automatic speech recognition to assess spoken responses to cognitive tests of semantic verbal fluency
    Pakhomov, Serguei V. S.
    Marino, Susan E.
    Banks, Sarah
    Bernick, Charles
    SPEECH COMMUNICATION, 2015, 75 : 14 - 26
  • [48] Performance improvement of automatic speech recognition systems via multiple language models produced by sentence-based clustering
    Podder, SK
    Shaban, K
    Sun, I
    Karray, F
    Basir, O
    Kamel, M
    2003 INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, PROCEEDINGS, 2003, : 362 - 367
  • [49] Latent Words Recurrent Neural Network Language Models for Automatic Speech Recognition
    Masumura, Ryo
    Asami, Taichi
    Oba, Takanobu
    Sakauchi, Sumitaka
    Ito, Akinori
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (12) : 2557 - 2567
  • [50] Examining the Effects of Automatic Speech Recognition Technology on Learners' Lexical Diversity
    Jiang, Michael Yi-Chao
    Jong, Morris Siu-Yung
    Lau, Wilfred Wing-Fat
    29TH INTERNATIONAL CONFERENCE ON COMPUTERS IN EDUCATION (ICCE 2021), VOL I, 2021, : 542 - 544