TIME-DOMAIN AUDIO-VISUAL SPEECH SEPARATION ON LOW QUALITY VIDEOS

被引：6

作者：

Wu, Yifei ^{[1
]}

Li, Chenda ^{[1
]}

Bai, Jinfeng ^{[2
]}

Wu, Zhongqin ^{[2
]}

Qian, Yanmin ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence,X LANCE Lab, Shanghai, Peoples R China

[2] TAL Educ Grp, Shanghai, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

基金：

国家重点研发计划;

关键词：

Audio-visual; Speech Separation; Low Quality Video; Attention;

D O I：

10.1109/ICASSP43922.2022.9746866

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Incorporating visual information is a promising approach to improve the performance of speech separation. Many related works have been conducted and provide inspiring results. However, low quality videos appear commonly in real scenarios, which may significantly degrade the performance of normal audio-visual speech separation system. In this paper, we propose a new structure to fuse the audio and visual features, which uses the audio feature to select relevant visual features by utilizing the attention mechanism. A Conv-TasNet based model is combined with the proposed attention-based multi-modal fusion, trained with proper data augmentation and evaluated with 3 categories of low quality videos. The experimental results show that our system outperforms the baseline which simply concatenates the audio and visual features when training with normal or low quality data, and is robust to low quality video inputs at inference time.

引用

页码：256 / 260

页数：5

共 32 条

[1] My lips are concealed: Audio-visual speech enhancement through obstructions
Afouras, Triantafyllos
Chung, Joon Son
Zisserman, Andrew
[J]. INTERSPEECH 2019, 2019, : 4295 - 4299
[2] Afouras T, 2018, INTERSPEECH, P3244
[3] Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception
Ceolini, Enea
Hjortkjaer, Jens
De Wong, Daniel
O'Sullivan, James
Raghavan, Vinay S.
Herrero, Jose
Mehta, Ashesh D.
Liu, Shih-Chii
Mesgarani, Nima
[J]. NEUROIMAGE, 2020, 223
[4] SOME EXPERIMENTS ON THE RECOGNITION OF SPEECH, WITH ONE AND WITH 2 EARS
CHERRY, EC
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1953, 25 (05) : 975 - 979
[5] Lip Reading Sentences in the Wild
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3444 - 3450
[6] Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661
[7] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Ephrat, Ariel
Mosseri, Inbar
Lang, Oran
Dekel, Tali
Wilson, Kevin
Hassidim, Avinatan
Freeman, William T.
Rubinstein, Michael
[J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
[8] VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
Gao, Ruohan
Grauman, Kristen
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15490 - 15500
[9] Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[10] Single-Channel Multi-Speaker Separation using Deep Clustering
Isik, Yusuf
Le Roux, Jonathan
Chen, Zhuo
Watanabe, Shinji
Hershey, John R.
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 545 - 549

← 1 2 3 4 →