E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing

被引:0
作者
Yu, Xiaojing [1 ]
Zhang, Lan
Li, Xiang-yang
机构
[1] Univ Sci & Technol China, Dept Comp Sci, Hefei, Anhui, Peoples R China
来源
2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON | 2023年
基金
国家重点研发计划;
关键词
active speaker detection; filtering; temporality-level stream; SELECTION;
D O I
10.1109/SECON58729.2023.10287518
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Active Speaker Detection (ASD) aims to enhance communication and interaction in various scenarios, including meetings, group discussions, and security surveillance systems. The primary objective of ASD is to identify and label the position of the main active speaker. In large-scale surveillance systems, real-time ASD can pose network congestion issues due to the extensive video data uploaded from numerous cameras. To address this challenge, we propose a collaborative edge-cloud solution called E-TALK for ASD. E-TALK leverages the simplicity of voiceprint comparison and processing, as opposed to analyzing video sequences. It utilizes voiceprint consistency as the criterion for determining if there has been a change in the active speaker. Our research focuses on evaluating the performance and computational costs of different voiceprint features and recognition models in speaker identification tasks. Additionally, E-TALK introduces a potential speaker tracking scheme for fixed-angle cameras, in conjunction with foreground extraction algorithms. Moreover, E-TALK incorporates a cloud-based high-precision facial ASD model, which utilizes historical information to determine the active speaker in real-time. We conducted experiments to evaluate the performance of our proposed solution in various scenarios and settings. The results demonstrate the effectiveness of the E-TALK approach in improving active speaker detection, highlighting its potential for practical application in surveillance systems.
引用
收藏
页数:9
相关论文
共 38 条
  • [1] Afouras T., 2018, arXiv
  • [2] Alcázar JL, 2020, PROC CVPR IEEE, P12462, DOI 10.1109/CVPR42600.2020.01248
  • [3] VIBE: A POWERFUL RANDOM TECHNIQUE TO ESTIMATE THE BACKGROUND IN VIDEO SEQUENCES
    Barnich, Olivier
    Van Droogenbroeck, Marc
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 945 - 948
  • [4] Bhardwaj R, 2022, PROCEEDINGS OF THE 19TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION (NSDI '22), P119
  • [5] Cross-Modal Supervision for Learning Active Speaker Detection in Video
    Chakravarty, Punarjay
    Tuytelaars, Tinne
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 285 - 301
  • [6] Chung J. S., 2020, arXiv
  • [7] Chung J. S., 2020, arXiv
  • [8] Chung Joon Son, 2017, P AS C COMP VIS, DOI DOI 10.1007/978-3-319-54427-4_19
  • [9] Desplanques B., 2020, arXiv
  • [10] FAST CU PARTITIONING ALGORITHM FOR H.266/VVC INTRA-FRAME CODING
    Fu, Ting
    Zhang, Hao
    Mu, Fan
    Chen, Huanbang
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 55 - 60