CNN-based speech segments endpoints detection framework using short-time signal energy features

被引:2
作者
Ahmed G. [1 ]
Lawaye A.A. [1 ]
机构
[1] Baba Ghulam Shah Badshah University, J&K, Rajouri
关键词
CNN-BiLSTM; Endpoint detection; Segmentation framework; Speech segmentation;
D O I
10.1007/s41870-023-01466-6
中图分类号
学科分类号
摘要
The quality of Speech Recognition systems has improved, with a shift focus from short utterance scenarios like Voice Assistants and Voice Search to extended utterance circumstances such as voice inputting and meeting transcriptions. In short utterance set-ups, speech end-points plays a crucial role in perceiving latency and user experience. For long-tailed circumstances, the prime intention is to generate a well formatted and highly readable transcriptions that aided to substitute typing with keyboard for vital permanent tasks such as writing e-mails or text documents. The significance of punctuation and capitalization becomes equally crucial as recognition errors. In the case of long utterances, valuable processing time, bandwidth, and other resources can be conserved by disregarding unnecessary portion of the audio signal. This optimization ultimately leads to enhance throughput of the system. In this study, we develop a framework called Speech Segments Endpoint Detection which utilizes short-time energy signal features, a simple Mel-spectrogram, and a hybrid Convolution Neural Network-Bidirectional Long short-term Memory (CNN-BiLSTM) model for classification. We conducted experiment using our CNN-BiLSTM classification model on a 35-h audio dataset. This dataset comprised of 16 h of speech data and 19 h of audio containing music and noise. The dataset was split into training and validation sets in an 80:20 ratio. Our model attained an accuracy of 98.67% on the training set and 93.62% on the validation set. © 2023, The Author(s), under exclusive licence to Bharati Vidyapeeth's Institute of Computer Applications and Management.
引用
收藏
页码:4179 / 4191
页数:12
相关论文
共 66 条
  • [1] Barkani F., Hamidi M., Laaidi N., Zealouk O., Satori H., Satori K., Amazigh speech recognition based on the Kaldi ASR toolkit, Int J Inf Technol, 2023, pp. 1-8, (2023)
  • [2] Hwang I., Chang J.H., End-to-end speech endpoint detection utilizing acoustic and language modeling knowledge for online low-latency speech recognition, IEEE Access, 8, pp. 161109-161123, (2020)
  • [3] Aytar Y., Vondrick C., Undefined: Soundnet: Learning Sound Representations from Unlabeled Video, (2016)
  • [4] Basbug A.M., Sert M., Analysis of deep neural network models for acoustic scene classification, 27Th Signal Processing and Communications Applications Conference, SIU 2019, (2019)
  • [5] Chen L., Zheng X., Zhang C., Guo L., Yu B., Multi-scale temporal-frequency attention for music source separation, Proceedings-Ieee International Conference on Multimedia and Expo. 2022-July, (2022)
  • [6] Mak M.W., Yu H.B., A study of voice activity detection techniques for NIST speaker recognition evaluations, Comput Speech Lang, 28, pp. 295-313, (2014)
  • [7] Mousazadeh S., Cohen I., Voice activity detection in presence of transient noise using spectral clustering, IEEE Trans Audio Speech Lang Process, 21, pp. 1261-1271, (2013)
  • [8] Liu B., Hoffmeister B., Rastrow A., Accurate Endpointing with Expected Pause Duration, (2015)
  • [9] Maas R., Rastrow A., Goehner K., Tiwari G., Joseph S., Domain-Specific Utterance End-Point Detection for Speech Recognition, (2017)
  • [10] Maas R., Rastrow A., Ma C., Lan G., Goehner K., Tiwari G., Joseph S., Hoffmeister B., Combining acoustic embeddings and decoding features for end-of-utterance detection in real-time far-field speech recognition systems, Ieeexplore.Ieee.Org.