Rethinking the visual cues in audio-visual speaker extraction

被引:1
|
作者
Li, Junjie [1 ]
Ge, Meng [2 ,4 ]
Pan, Zexu [3 ]
Cao, Rui [1 ]
Wang, Longbiao [1 ]
Dang, Jianwu [1 ]
Zhang, Shiliang
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Natl Univ Singapore, Inst Data Sci, Singapore, Singapore
[4] Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
Visual cues; speaker extraction; identity; synchronization; decouple; SPEECH;
D O I
10.21437/Interspeech.2023-2545
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction performance. This raises the question of how to better utilize visual cues. To address this issue, we propose two training strategies that decouple the learning of the two visual cues. Our experimental results demonstrate that both visual cues are useful, with the synchronization cue having a higher impact. We introduce a more explainable model, the Decoupled Audio-Visual Speaker Extraction (DAVSE) model, which leverages both visual cues.
引用
收藏
页码:3754 / 3758
页数:5
相关论文
共 50 条
  • [1] Target Active Speaker Detection with Audio-visual Cues
    Jiang, Yidi
    Tao, Ruijie
    Pan, Zexu
    Li, Haizhou
    INTERSPEECH 2023, 2023, : 3152 - 3156
  • [2] RETHINKING AUDIO-VISUAL SYNCHRONIZATION FOR ACTIVE SPEAKER DETECTION
    Wuerkaixi, Abudukelimu
    Zhang, You
    Duan, Zhiyao
    Zhang, Changshui
    2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
  • [3] Dynamic visual features for audio-visual speaker verification
    Dean, David
    Sridharan, Sridha
    COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149
  • [4] Audio-Visual Synchronisation for Speaker Diarisation
    Garau, Giulia
    Dielmann, Alfred
    Bourlard, Herve
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2662 - +
  • [5] PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
    Liu, Qinghua
    Ge, Meng
    Wu, Zhizheng
    Li, Haizhou
    INTERSPEECH 2023, 2023, : 3719 - 3723
  • [6] A Visual Signal Reliability for Robust Audio-Visual Speaker Identification
    Tariquzzaman, Md.
    Kim, Jin Young
    Na, Seung You
    Kim, Hyoung-Gook
    Har, Dongsoo
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (10): : 2052 - 2055
  • [7] Audio-visual speaker identification based on the use of dynamic audio and visual features
    Fox, N
    Reilly, RB
    AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 743 - 751
  • [8] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
    Ivanko, Denis
    Ryumin, Dmitry
    Axyonov, Alexandr
    Kashevnik, Alexey
    Karpov, Alexey
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
  • [9] Non-invasive extraction of audio-visual cues for multimodal applications
    Kabre, H
    HYBRID IMAGE AND SIGNAL PROCESSING VI, 1998, 3389 : 133 - 138
  • [10] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076