CNN-based speech segments endpoints detection framework using short-time signal energy features

被引:2
作者
Ahmed G. [1 ]
Lawaye A.A. [1 ]
机构
[1] Baba Ghulam Shah Badshah University, J&K, Rajouri
关键词
CNN-BiLSTM; Endpoint detection; Segmentation framework; Speech segmentation;
D O I
10.1007/s41870-023-01466-6
中图分类号
学科分类号
摘要
The quality of Speech Recognition systems has improved, with a shift focus from short utterance scenarios like Voice Assistants and Voice Search to extended utterance circumstances such as voice inputting and meeting transcriptions. In short utterance set-ups, speech end-points plays a crucial role in perceiving latency and user experience. For long-tailed circumstances, the prime intention is to generate a well formatted and highly readable transcriptions that aided to substitute typing with keyboard for vital permanent tasks such as writing e-mails or text documents. The significance of punctuation and capitalization becomes equally crucial as recognition errors. In the case of long utterances, valuable processing time, bandwidth, and other resources can be conserved by disregarding unnecessary portion of the audio signal. This optimization ultimately leads to enhance throughput of the system. In this study, we develop a framework called Speech Segments Endpoint Detection which utilizes short-time energy signal features, a simple Mel-spectrogram, and a hybrid Convolution Neural Network-Bidirectional Long short-term Memory (CNN-BiLSTM) model for classification. We conducted experiment using our CNN-BiLSTM classification model on a 35-h audio dataset. This dataset comprised of 16 h of speech data and 19 h of audio containing music and noise. The dataset was split into training and validation sets in an 80:20 ratio. Our model attained an accuracy of 98.67% on the training set and 93.62% on the validation set. © 2023, The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management.
引用
收藏
页码:4179 / 4191
页数:12
相关论文
共 66 条
  • [31] Rahman M., Khatun F., Preface M.B.-E., undefined: Blocking black area method for speech segmentation, (2015)
  • [32] Rahman M., Undefined: Continuous Bangla Speech Segmentation Using Short-Term Speech Features Extraction Approaches, (2012)
  • [33] Theera-Umpon N., Vilasdechanon J., Ratsameewichai S., Theera-Umpon N., Vilasdechanon J., Uatrongjit S., Likit-Anurucks K., Thai Phoneme Segmentation Using Dual-Band Energy Contour
  • [34] Roberts A., Engel J., Raffel C., Hawthorne C., Eck D., A hierarchical latent vector model for learning long-term structure in music, (2018)
  • [35] Solanki A., Pandey S., Music instrument recognition using deep convolutional neural networks, Int J Inform Technol (Singapore), 14, pp. 1659-1668, (2022)
  • [36] Scheirer E., Undefined: Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator, (1997)
  • [37] Li X., Liu H., Zheng Y., Xu B., Robust speech endpoint detection based on improved adaptive band-partitioning spectral entropy, Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 4688, pp. 36-45, (2007)
  • [38] Zhang H., Hu H., An endpoint detection algorithm based on MFCC and spectral entropy using BP NN, ICSPS 2010—Proceedings of the 2010 2Nd International Conference on Signal Processing Systems, 2, (2010)
  • [39] Li J., Ping Z., Xinxing J., Zhiran D.U., Speech endpoint detection method based on TEO in noisy environment, Proc Eng, 29, pp. 2655-2660, (2012)
  • [40] Ali Z., Talha M., innovative method for unsupervised voice activity detection and classification of audio segments, IEEE Access, 6, pp. 15494-15504, (2018)