Target Active Speaker Detection with Audio-visual Cues

被引:2
作者
Jiang, Yidi [1 ]
Tao, Ruijie [1 ]
Pan, Zexu [1 ]
Li, Haizhou [1 ,2 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
来源
INTERSPEECH 2023 | 2023年
基金
中国国家自然科学基金;
关键词
Active speaker detection; target speaker; audiovisual; speaker recognition; SPEECH;
D O I
10.21437/Interspeech.2023-574
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6% in mAP on the AVA-ActiveSpeaker validation set, and 0.8%, 0.4%, and 0.8% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at https://github.com/Jiang-Yidi/TS-TalkNet/.
引用
收藏
页码:3152 / 3156
页数:5
相关论文
共 42 条
  • [1] Deep Audio-Visual Speech Recognition
    Afouras, Triantafyllos
    Chung, Joon Son
    Senior, Andrew
    Vinyals, Oriol
    Zisserman, Andrew
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8717 - 8727
  • [2] Afouras Triantafyllos, 2020, LNCS, DOI DOI 10.1007/978-3-030-58523-5_13
  • [3] End-to-End Active Speaker Detection
    Alcazar, Juan Leon
    Cordes, Moritz
    Zhao, Chen
    Ghanem, Bernard
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 126 - 143
  • [4] Alcazar Juan Leon, 2020, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, P12465
  • [5] Killing Two Birds with One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC
    An, Xiang
    Deng, Jiankang
    Guo, Jia
    Feng, Ziyong
    Zhu, XuHan
    Yang, Jing
    Liu, Tongliang
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4032 - 4041
  • [6] The cocktail-party problem revisited: early processing and selection of multi-talker speech
    Bronkhorst, Adelbert W.
    [J]. ATTENTION PERCEPTION & PSYCHOPHYSICS, 2015, 77 (05) : 1465 - 1487
  • [7] Bronkhorst AW, 2000, ACUSTICA, V86, P117
  • [8] Chen Z., 2023, INTERSPEECH
  • [9] Spot the conversation: speaker diarisation in the wild
    Chung, Joon Son
    Huh, Jaesung
    Nagrani, Arsha
    Afouras, Triantafyllos
    Zisserman, Andrew
    [J]. INTERSPEECH 2020, 2020, : 299 - 303
  • [10] Who said that?: Audio-visual speaker diarisation of real-world meetings
    Chung, Joon Son
    Lee, Bong-Jin
    Han, Icksang
    [J]. INTERSPEECH 2019, 2019, : 371 - 375