Mutual Alignment between Audiovisual Features for End-to-End Audiovisual Speech Recognition

被引：5

作者：

Liu, Hong ^{[1
]}

Wang, Yawei ^{[1
]}

Yang, Bing ^{[1
]}

机构：

[1] Peking Univ, Key Lab Machine Percept, Shenzhen Grad Sch, Shenzhen, Peoples R China

来源：

2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2021年

基金：

中国国家自然科学基金;

关键词：

multimodal alignment; audio visual speech recognition; mutual iterative attention;

D O I：

10.1109/ICPR48806.2021.9412349

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Asynchronization issue caused by different types of modalities is one of the major problems in audio visual speech recognition (AVSR) research. However, most AVSR systems merely rely on up sampling of video or down sampling of audio to align audio and visual features, assuming that the feature sequences are aligned frame-by-frame. These pre-processing steps oversimplify the asynchrony relation between acoustic signal and lip motion, lacking flexibility and impairing the performance of the system. Although there are systems modeling the asynchrony between the modalities, sometimes they fail to align speech and video precisely over some even all noisy conditions. In this paper, we propose a mutual feature alignment method for AVSR which can make full use of cross modility information to address the asynchronization issue by introducing Mutual Iterative Attention (MIA) mechanism. Our method can automatically learn an alignment in a mutual way by performing mutual attention iteratively between the audio and visual features, relying on the modified encoder structure of Transformer. Experimental results show that our proposed method obtains absolute improvements up to 20.42% over the audio modality alone depending upon the signal-to-noise-ratio (SNR) level. Better recognition performance can also be achieved comparing with the traditional feature concatenation method under both clean and noisy conditions. It is expectable that our proposed mutual feature alignment method can be easily generalized to other multimodal tasks with semantically correlated information.

引用

页码：5348 / 5353

页数：6

共 23 条

[1] [Anonymous], 2016, P AS C COMP VIS
[2] Ba J. L., 2016, P ADV NEUR INF PROC
[3] BREGLER C, 1994, INT CONF ACOUST SPEE, P669, DOI 10.1109/ICASSP.1994.389567
[4] Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
[5] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
Dahl, George E.
Yu, Dong
Deng, Li
Acero, Alex
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
[6] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947
[7] Visual model structures and synchrony constraints for audio-visual speech recognition
Hazen, TJ
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (03): : 1082 - 1089
[8] He Kaiming, 2015, C COMP VIS PATT REC
[9] Keating P.A., 1988, Phonology, V5, P275, DOI DOI 10.1017/S095267570000230X
[10] Kingma DP, 2015, C TRACK P

← 1 2 3 →