Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance

被引:0
作者
Perezhohin, Yuriy [1 ,2 ]
Santos, Tiago [1 ,2 ]
Costa, Victor [1 ,2 ]
Peres, Fernando [1 ]
Castelli, Mauro [2 ]
机构
[1] MyNorth AI Res, P-2780125 Oeiras, Portugal
[2] Univ NOVA Lisboa, NOVA Informat Management Sch NOVA IMS, Campus Campolide, P-1070312 Lisbon, Portugal
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Hidden Markov models; Feature extraction; Filtering; Data models; Synthetic data; Training; Contrastive learning; Accuracy; Adaptation models; Transformers; Automatic speech recognition; Text to speech; contrastive learning; data augmentation; embeddings; synthetic data filtering; text-to-speech; REPRESENTATIONS;
D O I
10.1109/ACCESS.2024.3482970
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent filtering, especially on non-normalized text. This work highlights the importance of adjusting synthetic data augmentation and filtering to specific model architectures and target domains. The proposed method, robust and adaptable, enhances ASR performance across diverse language settings. We have open-sourced the entire work, which includes 140 hours of synthetically generated Portuguese speech, as well as the pipeline and parameter settings used to create these samples. Additionally, we provide the fine-tuned Whisper models and the code required to reproduce this research. Our code will be available at https://github.com/my-north-ai/semantic_audio_filtering.
引用
收藏
页码:155136 / 155150
页数:15
相关论文
共 50 条
  • [31] EFFICIENT ADAPTER TRANSFER OF SELF-SUPERVISED SPEECH MODELS FOR AUTOMATIC SPEECH RECOGNITION
    Thomas, Bethan
    Kessler, Samuel
    Karout, Salah
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7102 - 7106
  • [32] Enhancing Vocal Tract Length Normalization with Elastic Registration for Automatic Speech Recognition
    Mueller, Florian
    Mertins, Alfred
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1362 - 1365
  • [33] DWT features performance analysis for automatic speech recognition of Urdu
    Ali, Hazrat
    Ahmad, Nasir
    Zhou, Xianwei
    Iqbal, Khalid
    Ali, Sahibzada Muhammad
    SPRINGERPLUS, 2014, 3 : 1 - 10
  • [34] Enhancing Automatic Speech Recognition Quality with a Second-Stage Speech Enhancement Generative Adversarial Network
    Nossier, Soha A.
    Wall, Julie
    Moniri, Mansour
    Glackin, Cornelius
    Cannings, Nigel
    2023 IEEE 35TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2023, : 546 - 552
  • [35] DATA-FILTERING METHODS FOR SELF-TRAINING OF AUTOMATIC SPEECH RECOGNITION SYSTEMS
    Georgescu, Alexandru-Lucian
    Manolache, Cristian
    Oneata, Dan
    Cucu, Horia
    Burileanu, Corneliu
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 141 - 147
  • [36] Detecting Audio Adversarial Examples in Automatic Speech Recognition Systems Using Decision Boundary Patterns
    Zong, Wei
    Chow, Yang-Wai
    Susilo, Willy
    Kim, Jongkil
    Le, Ngoc Thuy
    JOURNAL OF IMAGING, 2022, 8 (12)
  • [37] Audio-visual feature fusion via deep neural networks for automatic speech recognition
    Rahmani, Mohammad Hasan
    Almasganj, Farshad
    Seyyedsalehi, Seyyed Ali
    DIGITAL SIGNAL PROCESSING, 2018, 82 : 54 - 63
  • [38] The Effects of Automatic Speech Recognition Quality on Human Transcription Latency
    Gaur, Yashesh
    Lasecki, Walter S.
    Metze, Florian
    Bigham, Jeffrey P.
    13TH WEB FOR ALL CONFERENCE MONTREAL, CANADA 2016, 2016,
  • [39] Comparison Of Language Models Trained On Written Texts And Speech Transcripts In The Context Of Automatic Speech Recognition
    Dziadzio, Sebastian
    Nabozny, Aleksandra
    Smywinski-Pohl, Aleksander
    Ziolko, Bartosz
    PROCEEDINGS OF THE 2015 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2015, 5 : 193 - 197
  • [40] A Survey of the Effects of Data Augmentation for Automatic Speech Recognition Systems
    Manuel Ramirez, Jose
    Montalvo, Ana
    Ramon Calvo, Jose
    PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS (CIARP 2019), 2019, 11896 : 669 - 678