Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

被引：0

作者：

Tamura, Satoshi ^{[1
]}

Ninomiya, Hiroshi ^{[2
]}

Kitaoka, Norihide ^{[3
]}

Osuga, Shin ^{[4
]}

Iribe, Yurie ^{[5
]}

Takeda, Kazuya ^{[2
]}

Hayamizu, Satoru ^{[1
]}

机构：

[1] Gifu Univ, Gifu, Japan

[2] Nagoya Univ, Nagoya, Aichi 4648601, Japan

[3] Tokushima Univ, Tokushima, Japan

[4] Aisin Seiki Co Ltd, Kariya, Aichi, Japan

[5] Aichi Prefectural Univ, Nagakute, Aichi, Japan

来源：

2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA) | 2015年

关键词：

D O I：

暂无

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.

引用

页码：575 / 582

页数：8

共 50 条

[21] An audio-visual speech recognition with a new mandarin audio-visual database
Liao, Wen-Yuan
Pao, Tsang-Long
Chen, Yu-Te
Chang, Tsun-Wei
INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
[22] Depth-based Features in Audio-Visual Speech Recognition
Palecek, Karel
Chaloupka, Josef
2016 39TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2016, : 303 - 306
[23] Scope for Deep Learning:A Study in Audio-Visual Speech Recognition
Bhaskar, Shabina
Thasleema, T. M.
PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND KNOWLEDGE ECONOMY (ICCIKE' 2019), 2019, : 72 - 77
[24] AUDIO-VISUAL DEEP LEARNING FOR NOISE ROBUST SPEECH RECOGNITION
Huang, Jing
Kingsbury, Brian
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7596 - 7599
[25] Audio-visual speech recognition using lstm and cnn
El Maghraby E.E.
Gody A.M.
Farouk M.H.
Recent Advances in Computer Science and Communications, 2021, 14 (06) : 2023 - 2039
[26] Audio-visual speech recognition using an infrared headset
Huang, J
Potamianos, G
Connell, J
Neti, C
SPEECH COMMUNICATION, 2004, 44 (1-4) : 83 - 96
[27] Audio-visual continuous speech recognition using mpeg-4 compliant visual features
Aleksic, PS
Williams, JJ
Wu, ZL
Katsaggelos, AK
2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 960 - 963
[28] MULTIPOSE AUDIO-VISUAL SPEECH RECOGNITION
Estellers, Virginia
Thiran, Jean-Philippe
19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1065 - 1069
[29] Audio-visual integration for speech recognition
Kober, R
Harz, U
NEUROLOGY PSYCHIATRY AND BRAIN RESEARCH, 1996, 4 (04) : 179 - 184
[30] Audio-visual speech recognition by speechreading
Zhang, XZ
Mersereau, RM
Clements, MA
DSP 2002: 14TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, 2002, : 1069 - 1072

← 1 2 3 4 5 →