Audio-Visual Speech Recognition Using A Two-Step Feature Fusion Strategy

被引:9
作者
Liu, Hong [1 ]
Xu, Wanlu [1 ]
Yang, Bing [1 ]
机构
[1] Peking Univ, Shenzhen Grad Sch, Key Lab Machine Percept, Shenzhen, Guangdong, Peoples R China
来源
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2021年
基金
中国国家自然科学基金;
关键词
speech recognition; feature fusion; non-local;
D O I
10.1109/ICPR48806.2021.9412454
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lip-reading methods and fusion strategy are crucial for audio-visual speech recognition. In recent years, most approaches involve two separate audio and visual streams with early or late fusion strategies. Such a single-stage fusion method may fail to guarantee the integrity and representativeness of fusion information simultaneously. This paper extends a traditional single-stage fusion network to a two-step feature fusion network by adding an audio-visual early feature fusion (AV-EFF) stream to the baseline model. This method can learn the fusion information of different stages, preserving the original features as much as possible and ensuring the independence of different features. Besides, to capture long-range dependencies of video information, a non-local block is added to the feature extraction part of the visual stream (NL-Visual) to obtain the long-term spatio-temporal features. Experimental results on the two largest public datasets in English (LRW) and Mandarin (LRW-1000) demonstrate our method is superior to other state-of-the-art methods.
引用
收藏
页码:1896 / 1903
页数:8
相关论文
共 34 条
[1]   Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition [J].
Abdelaziz, Ahmed Hussen .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (03) :475-484
[2]   Deep Audio-Visual Speech Recognition [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727
[3]  
Anina I., 2015, 2015 11 IEEE INT C W, V1, P1
[4]  
[Anonymous], 2016, P AS C COMP VIS
[5]  
[Anonymous], 2016, ARXIV161101599
[6]   Learning to lip read words by watching videos [J].
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2018, 173 :76-85
[7]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[8]  
Ding RW, 2018, IEEE IMAGE PROC, P4138, DOI 10.1109/ICIP.2018.8451096
[9]   On Dynamic Stream Weighting for Audio-Visual Speech Recognition [J].
Estellers, Virginia ;
Gurban, Mihai ;
Thiran, Jean-Philippe .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04) :1145-1157
[10]  
Guo LL, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2666, DOI 10.1109/ICASSP.2018.8462219