Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance

被引:0
作者
Perezhohin, Yuriy [1 ,2 ]
Santos, Tiago [1 ,2 ]
Costa, Victor [1 ,2 ]
Peres, Fernando [1 ]
Castelli, Mauro [2 ]
机构
[1] MyNorth AI Res, P-2780125 Oeiras, Portugal
[2] Univ NOVA Lisboa, NOVA Informat Management Sch NOVA IMS, Campus Campolide, P-1070312 Lisbon, Portugal
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Hidden Markov models; Feature extraction; Filtering; Data models; Synthetic data; Training; Contrastive learning; Accuracy; Adaptation models; Transformers; Automatic speech recognition; Text to speech; contrastive learning; data augmentation; embeddings; synthetic data filtering; text-to-speech; REPRESENTATIONS;
D O I
10.1109/ACCESS.2024.3482970
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a novel methodology for enhancing Automatic Speech Recognition (ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address the challenge of incorporating synthetic data into ASR training, especially in scenarios with limited real-world data or unique linguistic characteristics. The method utilizes a contrastive learning model to align representations of synthetic audio and its corresponding text transcripts, enabling the identification and removal of low-quality samples that do not align well semantically. We evaluate the methodology on a medium-resource language across two distinct datasets: a general-domain dataset and a regionally specific dataset characterized by unique pronunciation patterns. Experimental results reveal that the optimal filtering strategy depends on both model capacity and dataset characteristics. Larger models, like Whisper Large V3, particularly benefit from aggressive filtering, while smaller models may not require such stringent filtering, especially on non-normalized text. This work highlights the importance of adjusting synthetic data augmentation and filtering to specific model architectures and target domains. The proposed method, robust and adaptable, enhances ASR performance across diverse language settings. We have open-sourced the entire work, which includes 140 hours of synthetically generated Portuguese speech, as well as the pipeline and parameter settings used to create these samples. Additionally, we provide the fine-tuned Whisper models and the code required to reproduce this research. Our code will be available at https://github.com/my-north-ai/semantic_audio_filtering.
引用
收藏
页码:155136 / 155150
页数:15
相关论文
共 50 条
  • [21] Enhancing the Feature Extraction Process for Automatic Speech Recognition with Fractal Dimensions
    Aitzol Ezeiza
    Karmele López de Ipiña
    Carmen Hernández
    Nora Barroso
    Cognitive Computation, 2013, 5 : 545 - 550
  • [22] Compressing Audio Visual Speech Recognition Models With Parameterized Hypercomplex Layers
    Panagos, Iason Ioannis
    Sfikas, Giorgos
    Nikou, Christophoros
    PROCEEDINGS OF THE 12TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE, SETN 2022, 2022,
  • [23] Enhancing the Feature Extraction Process for Automatic Speech Recognition with Fractal Dimensions
    Ezeiza, Aitzol
    Lopez de Ipina, Karmele
    Hernandez, Carmen
    Barroso, Nora
    COGNITIVE COMPUTATION, 2013, 5 (04) : 545 - 550
  • [24] Data mining for generating predictive models of automatic speech recognition
    Al-Zobaydi, AT
    Al-Akaidi, MM
    John, RI
    MESM 2005: 7th Middle East Simulation Multiconference, 2005, : 147 - 150
  • [25] Neural Error Corrective Language Models for Automatic Speech Recognition
    Tanaka, Tomohiro
    Masumura, Ryo
    Masataki, Hirokazu
    Aono, Yushi
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 401 - 405
  • [26] Cochlear Mechanical Models used in Automatic Speech Recognition Tasks
    Oropeza Rodriguez, Jose Luis
    Suarez Guerra, Sergio
    COMPUTACION Y SISTEMAS, 2019, 23 (03): : 1099 - 1114
  • [27] EVALUATION OF SEMANTIC ROLE LABELING AND DEPENDENCY PARSING OF AUTOMATIC SPEECH RECOGNITION OUTPUT
    Favre, Benoit
    Bohnet, Bernd
    Hakkani-Tuer, Dilek
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5342 - 5345
  • [28] Harmonicity based dereverberation for improving automatic speech recognition performance and speech intelligibility
    Kinoshita, K
    Nakatani, T
    Miyoshi, M
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2005, E88A (07) : 1724 - 1731
  • [29] THE INFLUENCE OF AUTOMATIC SPEECH RECOGNITION ACCURACY ON THE PERFORMANCE OF AN AUTOMATED SPEECH ASSESSMENT SYSTEM
    Tao, Jidong
    Evanini, Keelan
    Wang, Xinhao
    2014 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY SLT 2014, 2014, : 294 - 299
  • [30] Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer
    Jin, Weifei
    Cao, Yuxin
    Su, Junjie
    Shen, Qi
    Ye, Kai
    Wang, Derui
    Hao, Jie
    Liu, Ziyao
    PROCEEDINGS OF THE 2ND ACM WORKSHOP ON SECURE AND TRUSTWORTHY DEEP LEARNING SYSTEMS, SECTL 2024, 2024, : 47 - 55