Effect of background Indian music on performance of speech recognition models for Hindi databases

被引:0
作者
Arvind Kumar
S. S. Solanki
Mahesh Chandra
机构
[1] Birla Institute of Technology,Department of Electronics and Communication Engineering
来源
International Journal of Speech Technology | 2023年 / 26卷
关键词
Multimedia signal processing; Robust ASR; Background music; Feature enhancement;
D O I
暂无
中图分类号
学科分类号
摘要
Multimedia content analysis has shown great interest over the past few decades. One of the works which find great attention to the researchers is automatic speech recognition (ASR) of speech data from broadcast radio and TV program. However, the presence of background music in such data heavily degrades the performance of ASR models. In this paper, we initially studied the temporal and spectral properties of music samples recorded from five different Indian instruments. Further, to see the effect of background Indian music on the recognition efficiency of ASR models for Hindi databases, these speech models were trained on both isolated and continuous speech databases using both clean and noisy databases. Hence, a total of four scenarios were considered: 1. Clean Isolated Database, 2. Noisy Isolated Database, 3. Clean Continuous Database, 4. Noisy Continuous Database. The variation of ASR performance was observed for different SNR levels of background music (0–30 dB). These background noises were combined with clean speech signals both independently where the sound of a single instrument was used as well as in combination with each other where sounds from several instruments were mixed. Overall, maximum degradation in performance of ASR is observed for background noise generated from audio samples of Been with an average WER of 13.37 and 72.21 for isolated and continuous text models whereas minimum degradation in performance of ASR is observed for background noise generated from audio samples of Harmonium and Flute with a WER of 15.25 and 66.09 for isolated text models and continuous text models respectively. We further correlated the observed results of ASR performance to the temporal and spectral properties of the music signals and found that higher values of Zero Crossing Rate, Roll-off rate, spectral centroid and spectral flux indicated greater degradation in ASR performance. Hence, these features are found to give important cues to understand the background noise as compared to other features like spectral entropy and Short Term Energy. The work presented in this paper will be useful in better understanding of music compensation algorithms focused on the Indian market.
引用
收藏
页码:1153 / 1164
页数:11
相关论文
共 37 条
[1]  
Demir C(2013)Single channel speech-music separation for robust ASR with mixture models IEEE Transactions on Audio, Speech, and Language Processing 21 725-736
[2]  
Saraclar M(2018)Performance evaluation of Hindi speech recognition system using optimized filterbanks Engineering Science and Technology, an International Journal 21 389-398
[3]  
Cemgil AT(2018)Discriminative training using noise robust integrated features and refined HMM modeling Journal of Intelligent Systems 29 327-344
[4]  
Dua M(2019)GFCC based discriminatively trained noise robust continuous ASR system for Hindi language Journal of Ambient Intelligence and Humanized Computing 10 2301-2314
[5]  
Aggarwal RK(2019)IITG-HingCoS corpus: A Hinglish code-switching database for automatic speech recognition Speech Communication 110 76-89
[6]  
Biswas M(2021)Enhancing accuracy of long contextual dependencies for Punjabi speech recognition system using deep LSTM International Journal of Speech Technology 24 517-527
[7]  
Dua M(2021)An acoustic model and linguistic analysis for Malayalam disyllabic words: A low resource language International Journal of Speech Technology 24 483-495
[8]  
Aggarwal RK(2014)An overview of noise-robust automatic speech recognition IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 745-777
[9]  
Biswas M(2018)Chhattisgarhi speech corpus for research and development in automatic speech recognition International Journal of Speech Technology 21 193-210
[10]  
Dua M(2020)Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news International Journal of Speech Technology 23 695-704