AUDIO-VISUAL SPEECH ENHANCEMENT AND SEPARATION BY UTILIZING MULTI-MODAL SELF-SUPERVISED EMBEDDINGS

被引：1

作者：

Chern, I-Chun ^{[1
]}

Hung, Kuo-Hsuan ^{[2
,3
]}

Chen, Yi-Ting ^{[3
]}

Hussain, Tassadaq ^{[4
]}

Gogate, Mandar ^{[4
]}

Hussain, Amir ^{[4
]}

Tsao, Yu ^{[3
]}

Hou, Jen-Cheng ^{[3
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Natl Taiwan Univ, Taipei, Taiwan

[3] Acad Sinica, Taipei, Taiwan

[4] Edinburgh Napier Univ, Edinburgh, Scotland

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年

关键词：

Audio-Visual Speech Enhancement; Audio-Visual Speech Separation; AV-HuBERT;

D O I：

10.1109/ICASSPW59220.2023.10193049

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.

引用

页数：5

共 50 条

[21] Multi-modal temporal asynchronicity modeling by product HMMs for robust audio-visual speech recognition
Nakamura, S
Kumatani, K
Tamura, S
FOURTH IEEE INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, PROCEEDINGS, 2002, : 305 - 309
[22] Single-modal Incremental Terrain Clustering from Self-Supervised Audio-Visual Feature Learning
Ishikawa, Reina
Hachiuma, Ryo
Kurobe, Akiyoshi
Saito, Hideo
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9399 - 9406
[23] Self-Supervised Audio-Visual Feature Learning for Single-Modal Incremental Terrain Type Clustering
Ishikawa, Reina
Hachiuma, Ryo
Saito, Hideo
IEEE ACCESS, 2021, 9 : 64346 - 64357
[24] INVESTIGATING SELF-SUPERVISED LEARNING FOR SPEECH ENHANCEMENT AND SEPARATION
Huang, Zili
Watanabe, Shinji
Yang, Shu-wen
Garcia, Paola
Khudanpur, Sanjeev
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6837 - 6841
[25] TOWARDS POSE-INVARIANT AUDIO-VISUAL SPEECH ENHANCEMENT IN THE WILD FOR NEXT-GENERATION MULTI-MODAL HEARING AIDS
Gogate, Mandar
Dashtipour, Kia
Hussain, Amir
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW, 2023,
[26] MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model With Multi-Modal Transformer
Zhu, Dandan
Zhu, Kun
Ding, Weiping
Zhang, Nana
Min, Xiongkuo
Zhai, Guangtao
Yang, Xiaokang
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (02): : 1756 - 1771
[27] VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency
Gao, Ruohan
Grauman, Kristen
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15490 - 15500
[28] Self-Supervised Moving Vehicle Detection From Audio-Visual Cues
Zuern, Jannik
Burgard, Wolfram
IEEE ROBOTICS AND AUTOMATION LETTERS, 2022, 7 (03) : 7415 - 7422
[29] Comparing Learning Methodologies for Self-Supervised Audio-Visual Representation Learning
Terbouche, Hacene
Schoneveld, Liam
Benson, Oisin
Othmani, Alice
IEEE ACCESS, 2022, 10 : 41622 - 41638
[30] Self-Supervised Audio-Visual Representation Learning for in-the-wild Videos
Feng, Zishun
Tu, Ming
Xia, Rui
Wang, Yuxuan
Krishnamurthy, Ashok
2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 5671 - 5672

← 1 2 3 4 5 →