A robust visual feature extraction based BTSM-LDA for audio-visual speech recognition

被引:0
|
作者
Lv, Guoyun [1 ]
Zhao, Rongchun [1 ]
Jiang, Dongmei [1 ]
Li, Yan [1 ]
Sahli, H. [2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Vrije Univ Brussel, Dept ETRO, B-1050 Brussels, Belgium
关键词
dynamic Bayesian networks; Bayesian tangent shape model; audio-visual; speech recognition;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
the asynchrony for speech and lip movement is key problem of audio-visual speech recognition (AVSR) system. A Multi-Stream Asynchrony Dynamic Bayesian Network (MS-ADBN) model is proposed for audio-visual speech recognition. Comparing with Multi-Stream HMM (MSE[MM), MS-ADBN model describes the asynchrony of audio stream and visual stream to the word level. Simultaneously, based on profile of lip implemented by using Bayesian Tangent Shape Model (BTSM), Linear Discrimination Analysis (LDA) is used for visual feature extraction which describes the dynamic feature of lip and removes the redundancy of lip geometrical feature. The experiments results on continuous digit audio-visual database show that Up dynamic feature based on BTSM and LDA is more stable and robust than direct lip geometrical feature. In the noisy environments with signal to noise ratios ranging from 0dB to 30dB, comparing with MSHMM, MS-ADBN model with MFCC and LDA visual features has an average improvement of 4.92% in speech recognition rate.
引用
收藏
页码:1044 / +
页数:2
相关论文
共 50 条
  • [41] Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition
    Wei, Liangfa
    Zhang, Jie
    Hou, Junfeng
    Dai, Lirong
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 638 - 643
  • [42] A ROBUST AUDIO-VISUAL SPEECH ENHANCEMENT MODEL
    Wang, Wupeng
    Xing, Chao
    Wang, Dong
    Chen, Xiao
    Sun, Fengyu
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7529 - 7533
  • [43] Cross-Domain Deep Visual Feature Generation for Mandarin Audio-Visual Speech Recognition
    Su, Rongfeng
    Liu, Xunying
    Wang, Lan
    Yang, Jingzhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 185 - 197
  • [44] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [45] A coupled HMM for audio-visual speech recognition
    Nefian, AV
    Liang, LH
    Pi, XB
    Xiaoxiang, L
    Mao, C
    Murphy, K
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 2013 - 2016
  • [46] Robust Audio-Visual Speech Recognition Under Noisy Audio-Video Conditions
    Stewart, Darryl
    Seymour, Rowan
    Pass, Adrian
    Ming, Ji
    IEEE TRANSACTIONS ON CYBERNETICS, 2014, 44 (02) : 175 - 184
  • [47] An asynchronous DBN for audio-visual speech recognition
    Saenko, Kate
    Livescu, Karen
    2006 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, 2006, : 154 - +
  • [48] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
  • [49] Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
    Hong, Joanna
    Kim, Minsu
    Choi, Jeongsoo
    Ro, Yong Man
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18783 - 18794
  • [50] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967