Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

被引:6
作者
Shi, Bowen [1 ]
Mohamed, Abdelrahman [2 ]
Hsu, Wei-Ning [2 ]
机构
[1] Toyota Technol Inst Chicago, Chicago, IL 60637 USA
[2] Meta AI, New York, NY USA
来源
INTERSPEECH 2022 | 2022年
关键词
audio-visual; speaker verification and recognition; representation learning; self-supervised pre-training;
D O I
10.21437/Interspeech.2022-885
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper investigates self-supervised pre-training for audiovisual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions(1).
引用
收藏
页码:4785 / 4789
页数:5
相关论文
共 44 条
[1]  
Afouras T., 2018, arXiv preprint arXiv:1809.00496
[2]  
[Anonymous], INTERSPEECH
[3]  
Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
[4]  
Baevski Alexei, 2022, data2vec: A general framework for self-supervised learning in speech, vision and language
[5]  
Chang Heng-Jui, 2021, ARXIV211001900
[6]  
Chen Sanyuan, 2021, ARXIV211013900
[7]   In defence of metric learning for speaker recognition [J].
Chung, Joon Son ;
Huh, Jaesung ;
Mun, Seongkyu ;
Lee, Minjae ;
Heo, Hee-Soo ;
Choe, Soyeon ;
Ham, Chiheon ;
Jung, Sunghwan ;
Lee, Bong-Jin ;
Han, Icksang .
INTERSPEECH 2020, 2020, :2977-2981
[8]  
Chung Joon Son, 2018, Interspeech
[9]   Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval [J].
Chung, Soo-Whan ;
Chung, Joon Son ;
Kang, Hong-Goo .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) :568-576
[10]  
Deng Jiankang, 2020, CVPR