Connectionism based audio-visual speech recognition method

被引：0

作者：

Che, Na ^{[1
,2
,3
]}

Zhu, Yi-Ming ^{[1
]}

Zhao, Jian ^{[1
,2
,3
]}

Sun, Lei ^{[1
]}

Shi, Li-Juan ^{[2
,3
,4
]}

Zeng, Xian-Wei ^{[1
]}

机构：

[1] School of Computer Science and Technology, Changchun University, Changchun

[2] Jilin Provincial Key Laboratory of Human Health State Identification and Function Enhancement, Changchun University, Changchun

[3] Key Laboratory of Intelligent Rehabilitation and Barrier ⁃ Free Access for the Disabled, Ministry of Education, Changchun University, Changchun

[4] School of Electronic and Information Engineering, Changchun University, Changchun

来源：

Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition) | 2024年 / 54卷 / 10期

关键词：

audio-visual speech recognition; computer application technology; connectionism; deep learning;

D O I：

10.13229/j.cnki.jdxbgxb.20240209

中图分类号：

学科分类号：

摘要：

Aiming at the problems of large data demand，audio and video data alignment，and noise robustness in audio visual speech recognition technology，this paper analyzes in depth the features and advantages of the four types of core models，namely，connectionist temporal classification，long short term memory，Transformer，and Conformer，summarizes the applicable scenarios of each model，and puts forward the ideas and methods to optimize the performance of the models. Then the model performance is quantitatively analyzed based on mainstream datasets and commonly used evaluation criteria. The results show that CTC has large performance fluctuations under noisy conditions， LSTM can effectively capture long temporal dependencies， and Transformer and Conformer can significantly reduce the recognition error rate in cross-modal tasks. Finally，future research directions are envisioned at the levels of self-supervised training and noise robustness. © 2024 Editorial Board of Jilin University. All rights reserved.

引用

页码：2984 / 2993

页数：9

共 44 条

[1]

Ibrahim T W S, A review of audio-visual speech recognition, Journal of Telecommunication, Electronic and Computer Engineering, 10, 1-4, pp. 35-40, (2018)

[2]

Su Rong-feng, Research on speech recognition sys⁃ tem under multiple influencing factors, (2020)

[3]

Tamura S, Ninomiya H, Kitaoka N, Et al., Audio-visu⁃ al speech recognition using deep bottleneck features and high-performance lipreading[C], Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 575-582, (2015)

[4]

Zeng Z, Tu J, Pianfetti B, Et al., Audio-visual affect recognition through multi-stream fused HMM for HCI [C], IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), pp. 967-972, (2005)

[5]

Wei Bin, Analysis of the integration path of symbol⁃ ism and connectionism of artificial intelligence, Study of Dialectics of Nature, 38, 2, pp. 23-29, (2022)

[6]

Zhang B, Zhu J, Su H., Toward the third generation artificial intelligence, Science China Information Sciences, 66, 2, pp. 1-19, (2023)

[7]

Jiao Li-cheng, Yang Shu-yuan, Liu Fang, Et al., Seventy years of neural networks: retrospect and pros⁃ pect, Chinese Journal of Computers, 39, 8, pp. 1697-1716, (2016)

[8]

Ivanko D, Ryumin D, Karpov A., A review of recent advances on deep learning methods for audio-visual speech recognition, Mathematics, 11, 12, (2023)

[9]

Wang D, Wang X D, Lyu S H., An overview of end-to-end automatic speech recognition, Symmetry, 11, 8, (2019)

[10]

Yu W, Zeiler S, Kolossa D., Fusing information streams in end-to-end audio-visual speech recognition [C], IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pp. 3430-3434, (2021)

← 1 2 3 4 5 →