AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS

被引:1
|
作者
Chern, I-Chun [1 ]
Hung, Kuo-Hsuan [2 ,3 ]
Chen, Yi-Ting [3 ]
Hussain, Tassadaq [4 ]
Gogate, Mandar [4 ]
Hussain, Amir [4 ]
Tsao, Yu [3 ]
Hou, Jen-Cheng [3 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Natl Taiwan Univ, Taipei, Taiwan
[3] Acad Sinica, Taipei, Taiwan
[4] Edinburgh Napier Univ, Edinburgh, Scotland
关键词
Audio-Visual Speech Enhancement; Audio-Visual Speech Separation; AV-HuBERT;
D O I
10.1109/ICASSPW59220.2023.10193049
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds
    Sato, Tomoya
    Sugano, Yusuke
    Sato, Yoichi
    IEEE ACCESS, 2022, 10 : 94273 - 94284
  • [32] Self-supervised multi-modal fusion network for multi-modal thyroid ultrasound image diagnosis
    Xiang, Zhuo
    Zhuo, Qiuluan
    Zhao, Cheng
    Deng, Xiaofei
    Zhu, Ting
    Wang, Tianfu
    Jiang, Wei
    Lei, Baiying
    COMPUTERS IN BIOLOGY AND MEDICINE, 2022, 150
  • [33] USING COMPRESSED AUDIO-VISUAL WORDS FOR MULTI-MODAL SCENE CLASSIFICATION
    Kurcius, Jan J.
    Breckon, Toby P.
    2014 INTERNATIONAL WORKSHOP ON COMPUTATIONAL INTELLIGENCE FOR MULTIMEDIA UNDERSTANDING (IWCIM), 2014,
  • [34] Audio-visual flow - A variational approach to multi-modal flow estimation
    Hamid, R
    Bobick, A
    Yezzi, A
    ICIP: 2004 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1- 5, 2004, : 2563 - 2566
  • [35] Audio-Visual Emotion Recognition System Using Multi-Modal Features
    Handa, Anand
    Agarwal, Rashi
    Kohli, Narendra
    INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE, 2021, 15 (04)
  • [36] Audio-Visual Scene Classification Based on Multi-modal Graph Fusion
    Lei, Han
    Chen, Ning
    INTERSPEECH 2022, 2022, : 4157 - 4161
  • [37] Lite Audio-Visual Speech Enhancement
    Chuang, Shang-Yi
    Tsao, Yu
    Lo, Chen-Chou
    Wang, Hsin-Min
    INTERSPEECH 2020, 2020, : 1131 - 1135
  • [38] Audio-visual enhancement of speech in noise
    Girin, L
    Schwartz, JL
    Feng, G
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2001, 109 (06): : 3007 - 3020
  • [39] Self-Supervised Distilled Learning for Multi-modal Misinformation Identification
    Mu, Michael
    Das Bhattacharjee, Sreyasee
    Yuan, Junsong
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2818 - 2827
  • [40] Self-supervised Multi-Modal Video Forgery Attack Detection
    Zhao, Chenhui
    Li, Xiang
    Younes, Rabih
    2023 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC, 2023,