Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

被引：6

作者：

Shi, Bowen ^{[1
]}

Mohamed, Abdelrahman ^{[2
]}

Hsu, Wei-Ning ^{[2
]}

机构：

[1] Toyota Technol Inst Chicago, Chicago, IL 60637 USA

[2] Meta AI, New York, NY USA

来源：

INTERSPEECH 2022 | 2022年

关键词：

audio-visual; speaker verification and recognition; representation learning; self-supervised pre-training;

D O I：

10.21437/Interspeech.2022-885

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper investigates self-supervised pre-training for audiovisual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions(1).

引用

页码：4785 / 4789

页数：5

共 44 条

[1]

Afouras T., 2018, arXiv preprint arXiv:1809.00496

[2]

[Anonymous], INTERSPEECH

[3]

Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations

[4]

Baevski Alexei, 2022, data2vec: A general framework for self-supervised learning in speech, vision and language

[5]

Chang Heng-Jui, 2021, ARXIV211001900

[6]

Chen Sanyuan, 2021, ARXIV211013900

[7] In defence of metric learning for speaker recognition [J].

Chung, Joon Son ;

Huh, Jaesung ;

Mun, Seongkyu ;

Lee, Minjae ;

Heo, Hee-Soo ;

Choe, Soyeon ;

Ham, Chiheon ;

Jung, Sunghwan ;

Lee, Bong-Jin ;

Han, Icksang .

INTERSPEECH 2020, 2020, :2977-2981

[8]

Chung Joon Son, 2018, Interspeech

[9] Perfect Match: Self-Supervised Embeddings for Cross-Modal Retrieval [J].

Chung, Soo-Whan ;

Chung, Joon Son ;

Kang, Hong-Goo .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2020, 14 (03) :568-576

[10]

Deng Jiankang, 2020, CVPR

← 1 2 3 4 5 →