A robust visual feature extraction based BTSM-LDA for audio-visual speech recognition

被引:0
|
作者
Lv, Guoyun [1 ]
Zhao, Rongchun [1 ]
Jiang, Dongmei [1 ]
Li, Yan [1 ]
Sahli, H. [2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
[2] Vrije Univ Brussel, Dept ETRO, B-1050 Brussels, Belgium
关键词
dynamic Bayesian networks; Bayesian tangent shape model; audio-visual; speech recognition;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
the asynchrony for speech and lip movement is key problem of audio-visual speech recognition (AVSR) system. A Multi-Stream Asynchrony Dynamic Bayesian Network (MS-ADBN) model is proposed for audio-visual speech recognition. Comparing with Multi-Stream HMM (MSE[MM), MS-ADBN model describes the asynchrony of audio stream and visual stream to the word level. Simultaneously, based on profile of lip implemented by using Bayesian Tangent Shape Model (BTSM), Linear Discrimination Analysis (LDA) is used for visual feature extraction which describes the dynamic feature of lip and removes the redundancy of lip geometrical feature. The experiments results on continuous digit audio-visual database show that Up dynamic feature based on BTSM and LDA is more stable and robust than direct lip geometrical feature. In the noisy environments with signal to noise ratios ranging from 0dB to 30dB, comparing with MSHMM, MS-ADBN model with MFCC and LDA visual features has an average improvement of 4.92% in speech recognition rate.
引用
收藏
页码:1044 / +
页数:2
相关论文
共 50 条
  • [21] Using Twin-HMM-Based Audio-Visual Speech Enhancement as a Front-End for Robust Audio-Visual Speech Recognition
    Abdelaziz, Ahmed Hussen
    Zeiler, Steffen
    Kolossa, Dorothea
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 867 - 871
  • [22] Feature Reconstruction using Sparse Imputation for Noise Robust Audio-Visual Speech Recognition
    Shen, Peng
    Tamura, Satoshi
    Hayamizu, Satoru
    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2012,
  • [23] AUDIO-VISUAL FEATURE INTEGRATION BASED ON PIECEWISE LINEAR TRANSFORMATION FOR NOISE ROBUST AUTOMATIC SPEECH RECOGNITION
    Kashiwagi, Yosuke
    Suzuki, Masayuki
    Minematsu, Nobuaki
    Hirose, Keikichi
    2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 149 - 152
  • [24] Robust Self-Supervised Audio-Visual Speech Recognition
    Shi, Bowen
    Hsu, Wei-Ning
    Mohamed, Abdelrahman
    INTERSPEECH 2022, 2022, : 2118 - 2122
  • [25] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
    Huang, Jing
    Kingsbury, Brian
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
  • [26] Coarse speech recognition by audio-visual integration based on missing feature theory
    Koiwa, Tomoaki
    Nakadai, Kazuhiro
    Imura, Jun-ichi
    2007 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, VOLS 1-9, 2007, : 1757 - 1762
  • [27] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [28] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [29] Audio-visual integration for speech recognition
    Kober, R
    Harz, U
    NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
  • [30] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
    Estellers, Virginia
    Thiran, Jean-Philippe
    19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069