Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

被引:0
作者
Tamura, Satoshi [1 ]
Ninomiya, Hiroshi [2 ]
Kitaoka, Norihide [3 ]
Osuga, Shin [4 ]
Iribe, Yurie [5 ]
Takeda, Kazuya [2 ]
Hayamizu, Satoru [1 ]
机构
[1] Gifu Univ, Gifu, Japan
[2] Nagoya Univ, Nagoya, Aichi 4648601, Japan
[3] Tokushima Univ, Tokushima, Japan
[4] Aisin Seiki Co Ltd, Kariya, Aichi, Japan
[5] Aichi Prefectural Univ, Nagakute, Aichi, Japan
来源
2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA) | 2015年
关键词
D O I
暂无
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.
引用
收藏
页码:575 / 582
页数:8
相关论文
共 50 条
  • [41] An asynchronous DBN for audio-visual speech recognition
    Saenko, Kate
    Livescu, Karen
    2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
  • [42] Audio-Visual Speech Enhancement using Deep Neural Networks
    Hou, Jen-Cheng
    Wang, Syu-Siang
    Lai, Ying-Hui
    Lin, Jen-Chun
    Tsao, Yu
    Chang, Hsiu-Wen
    Wang, Hsin-Min
    2016 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2016,
  • [43] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
  • [44] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [45] Performance Improvement of Audio-Visual Speech Recognition with Optimal Reliability Fusion
    Tariquzzaman, Md
    Gyu, Song Min
    Young, Kim Jin
    You, Na Seung
    Rashid, M. A.
    2010 THE 3RD INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION (PACIIA2010), VOL III, 2010, : 216 - 219
  • [46] Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition
    Aleksic, PS
    Katsaggelos, AK
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 917 - 920
  • [47] Audio-Visual Deep Clustering for Speech Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712
  • [48] AUDIO-VISUAL SPEECH INPAINTING WITH DEEP LEARNING
    Morrone, Giovanni
    Michelsanti, Daniel
    Tan, Zheng-Hua
    Jensen, Jesper
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6653 - 6657
  • [49] Multistream sparse representation features for noise robust audio-visual speech recognition
    Shen, Peng
    Tamura, Satoshi
    Hayamizu, Satoru
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2014, 35 (01) : 17 - 27
  • [50] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
    Matsuda, Yuki
    Fedotov, Dmitrii
    Takahashi, Yuta
    Arakawa, Yutaka
    Yasumo, Keiichi
    Minker, Wolfgang
    PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951