AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS

被引:1
|
作者
Chern, I-Chun [1 ]
Hung, Kuo-Hsuan [2 ,3 ]
Chen, Yi-Ting [3 ]
Hussain, Tassadaq [4 ]
Gogate, Mandar [4 ]
Hussain, Amir [4 ]
Tsao, Yu [3 ]
Hou, Jen-Cheng [3 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Natl Taiwan Univ, Taipei, Taiwan
[3] Acad Sinica, Taipei, Taiwan
[4] Edinburgh Napier Univ, Edinburgh, Scotland
关键词
Audio-Visual Speech Enhancement; Audio-Visual Speech Separation; AV-HuBERT;
D O I
10.1109/ICASSPW59220.2023.10193049
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] Multi-modal temporal asynchronicity modeling by product HMMs for robust audio-visual speech recognition
    Nakamura, S
    Kumatani, K
    Tamura, S
    FOURTH IEEE INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, PROCEEDINGS, 2002, : 305 - 309
  • [22] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
    Ishikawa, Reina
    Hachiuma, Ryo
    Kurobe, Akiyoshi
    Saito, Hideo
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
  • [23] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
    Ishikawa, Reina
    Hachiuma, Ryo
    Saito, Hideo
    IEEE ACCESS, 2021, 9 : 64346 - 64357
  • [24] INVESTIGATING SELF-SUPERVISED LEARNING FOR SPEECH ENHANCEMENT AND SEPARATION
    Huang, Zili
    Watanabe, Shinji
    Yang, Shu-wen
    Garcia, Paola
    Khudanpur, Sanjeev
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6837 - 6841
  • [25] TOWARDS POSE-INVARIANT AUDIO-VISUAL SPEECH ENHANCEMENT IN THE WILD FOR NEXT-GENERATION MULTI-MODAL HEARING AIDS
    Gogate, Mandar
    Dashtipour, Kia
    Hussain, Amir
    2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
  • [26] MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model With Multi-Modal Transformer
    Zhu, Dandan
    Zhu, Kun
    Ding, Weiping
    Zhang, Nana
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1756 - 1771
  • [27] VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
    Gao, Ruohan
    Grauman, Kristen
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15490 - 15500
  • [28] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
    Zuern, Jannik
    Burgard, Wolfram
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
  • [29] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
    Terbouche, Hacene
    Schoneveld, Liam
    Benson, Oisin
    Othmani, Alice
    IEEE ACCESS, 2022, 10 : 41622 - 41638
  • [30] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
    Feng, Zishun
    Tu, Ming
    Xia, Rui
    Wang, Yuxuan
    Krishnamurthy, Ashok
    2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672