Integration of audio-visual information for multi-speaker multimedia speaker recognition

被引:1
作者
Yang, Jichen [1 ]
Chen, Fangfan [1 ]
Cheng, Yu [2 ]
Lin, Pei [3 ]
机构
[1] Guangdong Polytech Normal Univ, Sch Cyber Secur, Guangzhou, Peoples R China
[2] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore, Singapore
[3] Guangdong Polytech Normal Univ, Sch Elect & Informat, Guangzhou, Peoples R China
关键词
Multi-speaker multimedia speaker recognition; Audio information; Visual information; FACE RECOGNITION; MODEL; DIARIZATION; TRACKING; FEATURES;
D O I
10.1016/j.dsp.2023.104315
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recently, multi-speaker multimedia speaker recognition (MMSR) has garnered significant attention. Although prior research primarily focused on the back-end score level fusion of audio and visual information, this study delves into innovative techniques for integrating audio and visual cues from the front-end representations of both speaker's voice and face. The first method introduces the use of visual information to estimate the number of speakers. This solution addresses the challenges of estimating speaker numbers in multi-speaker conversations, especially in noisy environments. Subsequently, agglomerative hierarchical clustering is employed for speaker diarization, proving beneficial for MMSR. This approach is termed video aiding audio fusion (VAAF). The second method innovates by introducing a ratio factor to create a multimedia vector (M-vector) which concatenates face embeddings with x-vector. This amalgamation encapsulates both audio and visual cues. The resulting M vector is then leveraged for MMSR. We name this method as video interacting audio fusion (VIAF). Experimental results on the NIST SRE 2019 audio-visual corpus reveal that the VAAF-based MMSR achieves a 6.94% and 8.31% relative reduction in minDCF and actDCF, respectively, when benchmarked against zero-effort systems. Additionally, the VIAF-based MMSR realizes a 12.08% and 12.99% relative reduction in minDCF and actDCF, respectively, compared to systems that solely utilize face embeddings. Notably, when combining both methods, the minDCF and actDCF metrics are further optimized, reaching 0.098 and 0.102, respectively.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Audio-Visual Speaker Recognition for Video Broadcast News
    Benoît Maison
    Chalapathy Neti
    Andrew Senior
    Journal of VLSI signal processing systems for signal, image and video technology, 2001, 29 : 71 - 79
  • [2] Audio-visual speaker recognition for video broadcast news
    Maison, B
    Neti, C
    Senior, A
    JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2001, 29 (1-2): : 71 - 79
  • [3] ENVIRONMENTALLY ROBUST AUDIO-VISUAL SPEAKER IDENTIFICATION
    Schoenherr, Lea
    Orth, Dennis
    Heckmann, Martin
    Kolossa, Dorothea
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 312 - 318
  • [4] SPEAKER RECOGNITION FOR MULTI-SPEAKER CONVERSATIONS USING X-VECTORS
    Snyder, David
    Garcia-Romero, Daniel
    Sell, Gregory
    McCree, Alan
    Povey, Daniel
    Khudanpur, Sanjeev
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5796 - 5800
  • [5] Speaker Clustering with Penalty Distance for Speaker Verification with Multi-Speaker Speech
    Das, Rohan Kumar
    Yang, Jichen
    Li, Haizhou
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1630 - 1635
  • [6] Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion
    Gebru, Israel D.
    Ba, Sileye
    Li, Xiaofei
    Horaud, Radu
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (05) : 1086 - 1099
  • [7] Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen
    Hoover, Ken
    Chaudhuri, Sourish
    Pantofaru, Caroline
    Sturdy, Ian
    Slaney, Malcolm
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6558 - 6562
  • [8] INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS
    Raj, Desh
    Denisov, Pavel
    Chen, Zhuo
    Erdogan, Hakan
    Huang, Zili
    He, Maokui
    Watanabe, Shinji
    Du, Jun
    Yoshioka, Takuya
    Luo, Yi
    Kanda, Naoyuki
    Li, Jinyu
    Wisdom, Scott
    Hershey, John R.
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 897 - 904
  • [9] Application of combined temporal and spectral processing methods for speaker recognition under noisy, reverberant or multi-speaker environments
    Krishnamoorthy, P.
    Prasanna, S. R. Mahadeva
    SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2009, 34 (05): : 729 - 754
  • [10] Who said that?: Audio-visual speaker diarisation of real-world meetings
    Chung, Joon Son
    Lee, Bong-Jin
    Han, Icksang
    INTERSPEECH 2019, 2019, : 371 - 375