Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need

被引:0
作者
Kuzdeuov, Askat [1 ]
Nurgaliyev, Shakhizat [1 ]
Turmakhan, Diana [1 ]
Laiyk, Nurkhan [1 ]
Varol, Huseyin Atakan [1 ]
机构
[1] Nazarbayev Univ, Inst Smart Syst & AI, Astana, Kazakhstan
来源
2023 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND ARTIFICIAL INTELLIGENCE, RAAI 2023 | 2023年
关键词
Speech commands recognition; text-to-speech; Kazakh Speech Corpus; voice commands; data-centric AI;
D O I
10.1109/RAAI59955.2023.10601292
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech Command Recognition (SCR) is rapidly gaining prominence due to its diverse applications, such as virtual assistants, smart homes, hands-free navigation, and voice-controlled industrial machinery. In this paper, we present a data-centric approach to creating SCR systems for low-resource languages, particularly focusing on the Kazakh language. By leveraging synthetic data generated by Text-to-Speech (TTS) and data extracted from a large-scale speech corpus, we successfully created the Kazakh language equivalent of the Google Speech Commands dataset. Moreover, we also compiled the Kazakh Speech Commands dataset with data collected from 119 participants. This dataset was used to benchmark the performance of the Keyword-MLP model trained using our synthetic dataset. The results showed that the model achieves 89.79% accuracy for the real-world data demonstrating the efficacy of our approach. Our work can serve as a recipe for creating customized speech command datasets, including for low-resource languages, obviating the need for laborious and costly human data collection.
引用
收藏
页码:286 / 291
页数:6
相关论文
共 21 条
[1]  
Anindya Citta, 2019, 2019 International Seminar on Intelligent Technology and Its Applications (ISITIA), P434, DOI 10.1109/ISITIA.2019.8937275
[2]   Keyword Transformer: A Self-Attention Model for Keyword Spotting [J].
Berg, Axel ;
O'Connor, Mark ;
Cruz, Miguel Tairum .
INTERSPEECH 2021, 2021, :4249-4253
[3]  
de Andrade Douglas Coimbra, 2018, ARXIV
[4]  
Gazneli A., 2022, End-to-end audio strikes back: Boosting augmentations towards an efficient audio classification network
[5]   AST: Audio Spectrogram Transformer [J].
Gong, Yuan ;
Chung, Yu-An ;
Glass, James .
INTERSPEECH 2021, 2021, :571-575
[6]   Speaking style compensation on synthetic audio for robust keyword spotting [J].
Huang, Houjun ;
Qian, Yanmin .
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, :448-452
[7]  
Kamath U, 2019, DEEP LEARNING NLP SP
[8]  
Lin J, 2020, INT CONF ACOUST SPEE, P7474, DOI [10.1109/icassp40776.2020.9053193, 10.1109/ICASSP40776.2020.9053193]
[9]   Automatic speech recognition: a survey [J].
Malik, Mishaim ;
Malik, Muhammad Kamran ;
Mehmood, Khawar ;
Makhdoom, Imran .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (06) :9411-9457
[10]  
Morshed M. M., 2022, Attention-free keyword spotting