Convolutional recurrent neural network with attention for Vietnamese speech to text problem in the operating room

被引：3

作者：

Dat T.T. ^{[1
]}

Dang L.T.A. ^{[2
]}

Sang V.N.T. ^{[1
]}

Thuy L.N.L. ^{[1
]}

Bao P.T. ^{[1
]}

机构：

[1] Information Science Faculty, Sai Gon University, HCM City

[2] Faculty of Electrical and Electronics Engineering, University of Technology, HCM City

来源：

International Journal of Intelligent Information and Database Systems | 2021年 / 14卷 / 03期

关键词：

Attention; Bidirectional long short-term memory; BLSTM; CNN; Convolutional neural network; Operating room; Vietnamese speech recognition;

D O I：

10.1504/IJIIDS.2021.116476

中图分类号：

学科分类号：

摘要：

We introduce automatic Vietnamese speech recognition (ASR) system for converting Vietnamese speech to text on a real operating room ambient noise recorded during liver surgery. First, we propose applying a combination between convolutional neural network (CNN) and bidirectional long short-term memory (BLSTM) for investigating local speech feature learning, sequence modelling, and transcription for speech recognition. We also extend the CNN-LSTM framework with an attention mechanism to decode the frames into a sequence of words. The CNN, LSTM and attention models are combining into a unified architecture. In addition, we combine connectionist temporal classification (CTC) and attention's loss functions in training phase. The length of the output label sequence from CTC is applied to the attention-based decoder predictions to make the final label sequence. This process helps to decrease irregular alignments and make speedup of the label sequence estimation during training and inference, instead of only relying on the data-driven attention-based encoder-decoder for estimating the label sequence in long sentences. The proposed system is evaluated using a real operating room database. The results show that our method significantly enhances the performance of the ASR system. We find that our approach provides a 13.05% in WER and outperforms standard methods. Copyright © 2021 Inderscience Enterprises Ltd.

引用

页码：294 / 314

页数：20

共 33 条

[1]

Bao N.Q., Hai D.V., Quyen D.B., Hung L.M., Development of a Vietnamese speech recognition system for Viettel Call Center, IEEE 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, pp. 1-5, (2017)

[2]

Bao N.Q., Tuan M.V., Trung L.Q., Quyen D.B., Hai D.V., Development of a Vietnamese large vocabulary continuous speech recognition system under noisy conditions, Proceedings of the Ninth International Symposium on Information and Communication Technology, pp. 222-226, (2018)

[3]

Chan W., Lane I., On online attention-based speech recognition and joint mandarin character-pinyin training, INTERSPEECH, pp. 3404-3408, (2016)

[4]

Chorowski J.K., Bahdanau D., Serdyuk D., Cho K.H., Bengio Y., Attention-based models for speech recognition, Advances in Neural Information Processing Systems (NIPS), 1, pp. 577-585, (2015)

[5]

Dahl G.E., Yu D., Deng L., Acero A., Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Trans. Audio, Speech Lang. Process, 20, 1, pp. 30-42, (2012)

[6]

Deng L., Platt J., Ensemble deep learning for speech recognition, Proc. Interspeech, pp. 1915-1919, (2014)

[7]

Gers F.A., Schmidhuber J., LSTM recurrent networks learn simple context free and context sensitive languages, IEEE Transactions on Neural Networks, 12, 6, pp. 1333-1340, (2001)

[8]

Graves A., Jaitly N., Towards end-to-end speech recognition with recurrent neural networks, Proceedings of the International Conference on Machine Learning, pp. 1764-1772, (2014)

[9]

Graves A., Fernandez S., Gomez F., Schmidhuber J., Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks, Proceedings of the 23rd International Conference on Machine Learning, pp. 369-376, (2006)

[10]

Graves A., Mohamed A., Hinton G.E., Speech recognition with deep recurrent neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645-6649, (2013)

← 1 2 3 4 →