Target Active Speaker Detection with Audio-visual Cues

被引：2

作者：

Jiang, Yidi ^{[1
]}

Tao, Ruijie ^{[1
]}

Pan, Zexu ^{[1
]}

Li, Haizhou ^{[1
,2
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;

D O I：

10.21437/Interspeech.2023-574

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.

引用

页码：3152 / 3156

页数：5

共 42 条

[1] Deep Audio-Visual Speech Recognition
Afouras, Triantafyllos
Chung, Joon Son
Senior, Andrew
Vinyals, Oriol
Zisserman, Andrew
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
[2] Afouras Triantafyllos, 2020, LNCS, DOI DOI 10.1007/978-3-030-58523-5_13
[3] End-to-End Active Speaker Detection
Alcazar, Juan Leon
Cordes, Moritz
Zhao, Chen
Ghanem, Bernard
[J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 126 - 143
[4] Alcazar Juan Leon, 2020, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, P12465
[5] Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC
An, Xiang
Deng, Jiankang
Guo, Jia
Feng, Ziyong
Zhu, XuHan
Yang, Jing
Liu, Tongliang
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4032 - 4041
[6] The cocktail-party problem revisited: early processing and selection of multi-talker speech
Bronkhorst, Adelbert W.
[J]. ATTENTION PERCEPTION & PSYCHOPHYSICS, 2015, 77 (05) : 1465 - 1487
[7] Bronkhorst AW, 2000, ACUSTICA, V86, P117
[8] Chen Z., 2023, INTERSPEECH
[9] Spot the conversation: speaker diarisation in the wild
Chung, Joon Son
Huh, Jaesung
Nagrani, Arsha
Afouras, Triantafyllos
Zisserman, Andrew
[J]. INTERSPEECH 2020, 2020, : 299 - 303
[10] Who said that?: Audio-visual speaker diarisation of real-world meetings
Chung, Joon Son
Lee, Bong-Jin
Han, Icksang
[J]. INTERSPEECH 2019, 2019, : 371 - 375

← 1 2 3 4 5 →