TIME-DOMAIN AUDIO-VISUAL SPEECH SEPARATION ON LOW QUALITY VIDEOS

被引:6
作者
Wu, Yifei [1 ]
Li, Chenda [1 ]
Bai, Jinfeng [2 ]
Wu, Zhongqin [2 ]
Qian, Yanmin [1 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence,X LANCE Lab, Shanghai, Peoples R China
[2] TAL Educ Grp, Shanghai, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
基金
国家重点研发计划;
关键词
Audio-visual; Speech Separation; Low Quality Video; Attention;
D O I
10.1109/ICASSP43922.2022.9746866
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Incorporating visual information is a promising approach to improve the performance of speech separation. Many related works have been conducted and provide inspiring results. However, low quality videos appear commonly in real scenarios, which may significantly degrade the performance of normal audio-visual speech separation system. In this paper, we propose a new structure to fuse the audio and visual features, which uses the audio feature to select relevant visual features by utilizing the attention mechanism. A Conv-TasNet based model is combined with the proposed attention-based multi-modal fusion, trained with proper data augmentation and evaluated with 3 categories of low quality videos. The experimental results show that our system outperforms the baseline which simply concatenates the audio and visual features when training with normal or low quality data, and is robust to low quality video inputs at inference time.
引用
收藏
页码:256 / 260
页数:5
相关论文
共 32 条
  • [1] My lips are concealed: Audio-visual speech enhancement through obstructions
    Afouras, Triantafyllos
    Chung, Joon Son
    Zisserman, Andrew
    [J]. INTERSPEECH 2019, 2019, : 4295 - 4299
  • [2] Afouras T, 2018, INTERSPEECH, P3244
  • [3] Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception
    Ceolini, Enea
    Hjortkjaer, Jens
    De Wong, Daniel
    O'Sullivan, James
    Raghavan, Vinay S.
    Herrero, Jose
    Mehta, Ashesh D.
    Liu, Shih-Chii
    Mesgarani, Nima
    [J]. NEUROIMAGE, 2020, 223
  • [5] Lip Reading Sentences in the Wild
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3444 - 3450
  • [6] Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661
  • [7] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [8] VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
    Gao, Ruohan
    Grauman, Kristen
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15490 - 15500
  • [9] Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
  • [10] Single-Channel Multi-Speaker Separation using Deep Clustering
    Isik, Yusuf
    Le Roux, Jonathan
    Chen, Zhuo
    Watanabe, Shinji
    Hershey, John R.
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 545 - 549