Exploring the Impact of Data Augmentation Techniques on Automatic Speech Recognition System Development: A Comparative Study

被引:0
作者
Galic, Jovan [1 ]
Grozdic, Dorde [2 ,3 ]
机构
[1] Univ Banja Luka, Fac Elect Engn, Patre 5, Banja Luka 78000, Bosnia & Herceg
[2] Grid Dynam, Bulevar Kralja Aleksandra 18, Belgrade 11000, Serbia
[3] Univ Belgrade, Sch Elect Engn, Bulevar Kralja Aleksandra 73, Belgrade 11000, Serbia
关键词
Artificial Neural Networks; audio databases; Automatic Speech Recognition; Hidden Markov models; Support Vector Machines;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech Recognition (ASR) systems are notorious for their poor performance in adverse conditions, leading to high sensitivity and low robustness. Due to the costly and time-consuming nature of creating extensive speech databases, addressing the issue of low robustness has become a prominent area of research, focusing on the synthetic generation of speech data using pre-existing natural speech. This paper examines the impact of standard data augmentation techniques, including pitch shift, time stretch, volume control, and their combination, on the accuracy of isolated-word ASR systems. The performance of three machine learning models, namely Hidden Markov Models (HMM), Support Vector Machines (SVM), and Convolutional Neural Networks (CNN), is analyzed on two Serbian corpora of isolated words. The Whi-Spe speech database in neutral phonation is utilized for augmentation and training, and a specifically developed Python-based software tool is employed for the augmentation process in this research study. The conducted experiments demonstrate a statistically significant reduction in the Word Error Rate (WER) for the CNN-based recognizer on both testing datasets, achieved through a single augmentation technique based on pitch-shifting.
引用
收藏
页码:3 / 12
页数:10
相关论文
共 37 条
  • [1] Data Augmentation and Deep Learning Methods in Sound Classification: A Systematic Review
    Abayomi-Alli, Olusola O.
    Damasevicius, Robertas
    Qazi, Atika
    Adedoyin-Olowe, Mariam
    Misra, Sanjay
    [J]. ELECTRONICS, 2022, 11 (22)
  • [2] Alsobhani Ayad, 2021, Journal of Physics: Conference Series, DOI 10.1088/1742-6596/1973/1/012166
  • [3] [Anonymous], 2006, The htk book
  • [4] [Anonymous], About us
  • [5] Arican E, 2022, ROM J INF SCI TECH, V25, P338
  • [6] Effects of Data Augmentations on Speech Emotion Recognition
    Atmaja, Bagus Tris
    Sasou, Akira
    [J]. SENSORS, 2022, 22 (16)
  • [7] Automatic Construction of a Large-Scale Speech Recognition Database Using Multi-Genre Broadcast Data with Inaccurate Subtitle Timestamps
    Bang, Jeong-Uk
    Choi, Mu-Yeol
    Kim, Sang-Hun
    Kwon, Oh-Wook
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (02) : 406 - 415
  • [8] Bernal-Chaves J., 2005, PROC IT R NONLINEAR, P137
  • [9] Improvement of K-means Cluster Quality by Post Processing Resulted Clusters
    Borlea, Ioan-Daniel
    Precup, Radu-Emil
    Borlea, Alexandra-Bianca
    [J]. 8TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND QUANTITATIVE MANAGEMENT (ITQM 2020 & 2021): DEVELOPING GLOBAL DIGITAL ECONOMY AFTER COVID-19, 2022, 199 : 63 - 70
  • [10] Data Augmentation using GANs for Speech Emotion Recognition
    Chatziagapi, Aggelina
    Paraskevopoulos, Georgios
    Sgouropoulos, Dimitris
    Pantazopoulos, Georgios
    Nikandrou, Malvina
    Giannakopoulos, Theodoros
    Katsamanis, Athanasios
    Potamianos, Alexandros
    Narayanan, Shrikanth
    [J]. INTERSPEECH 2019, 2019, : 171 - 175