E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing

被引:0
作者
Yu, Xiaojing [1 ]
Zhang, Lan
Li, Xiang-yang
机构
[1] Univ Sci & Technol China, Dept Comp Sci, Hefei, Anhui, Peoples R China
来源
2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON | 2023年
基金
国家重点研发计划;
关键词
active speaker detection; filtering; temporality-level stream; SELECTION;
D O I
10.1109/SECON58729.2023.10287518
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Active Speaker Detection (ASD) aims to enhance communication and interaction in various scenarios, including meetings, group discussions, and security surveillance systems. The primary objective of ASD is to identify and label the position of the main active speaker. In large-scale surveillance systems, real-time ASD can pose network congestion issues due to the extensive video data uploaded from numerous cameras. To address this challenge, we propose a collaborative edge-cloud solution called E-TALK for ASD. E-TALK leverages the simplicity of voiceprint comparison and processing, as opposed to analyzing video sequences. It utilizes voiceprint consistency as the criterion for determining if there has been a change in the active speaker. Our research focuses on evaluating the performance and computational costs of different voiceprint features and recognition models in speaker identification tasks. Additionally, E-TALK introduces a potential speaker tracking scheme for fixed-angle cameras, in conjunction with foreground extraction algorithms. Moreover, E-TALK incorporates a cloud-based high-precision facial ASD model, which utilizes historical information to determine the active speaker in real-time. We conducted experiments to evaluate the performance of our proposed solution in various scenarios and settings. The results demonstrate the effectiveness of the E-TALK approach in improving active speaker detection, highlighting its potential for practical application in surveillance systems.
引用
收藏
页数:9
相关论文
共 38 条
  • [31] Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image
    Tome, Denis
    Russell, Chris
    Agapito, Lourdes
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5689 - 5698
  • [32] Model Selection and Psychological Theory: A Discussion of the Differences Between the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC)
    Vrieze, Scott I.
    [J]. PSYCHOLOGICAL METHODS, 2012, 17 (02) : 228 - 243
  • [33] AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition
    Wang, Yulin
    Yue, Yang
    Lin, Yuanze
    Jiang, Haojun
    Lai, Zihang
    Kulikov, Victor
    Orlov, Nikita
    Shi, Humphrey
    Huang, Gao
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20030 - 20040
  • [34] Welch G., 1995, technical report 95-041
  • [35] Xingyi Zhou, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12349), P474, DOI 10.1007/978-3-030-58548-8_28
  • [36] VID-WIN: Fast Video Event Matching With Query-Aware Windowing at the Edge for the Internet of Multimedia Things
    Yadav, Piyush
    Salwala, Dhaval
    Curry, Edward
    [J]. IEEE INTERNET OF THINGS JOURNAL, 2021, 8 (13): : 10367 - 10389
  • [37] ANTIGONE: Accurate Navigation Path Caching in Dynamic Road Networks leveraging Route APIs
    Yu, Xiaojing
    Li, Xiang-Yang
    Zhao, Jing
    Shen, Guobin
    Freris, Nikolaos M.
    Zhang, Lan
    [J]. IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022), 2022, : 1599 - 1608
  • [38] S3FD: Single Shot Scale-invariant Face Detector
    Zhang, Shifeng
    Zhu, Xiangyu
    Lei, Zhen
    Shi, Hailin
    Wang, Xiaobo
    Li, Stan Z.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 192 - 201