E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing

被引：0

作者：

Yu, Xiaojing ^{[1
]}

Zhang, Lan

Li, Xiang-yang

机构：

[1] Univ Sci & Technol China, Dept Comp Sci, Hefei, Anhui, Peoples R China

来源：

2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON | 2023年

基金：

国家重点研发计划;

关键词：

active speaker detection; filtering; temporality-level stream; SELECTION;

D O I：

10.1109/SECON58729.2023.10287518

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Active Speaker Detection (ASD) aims to enhance communication and interaction in various scenarios, including meetings, group discussions, and security surveillance systems. The primary objective of ASD is to identify and label the position of the main active speaker. In large-scale surveillance systems, real-time ASD can pose network congestion issues due to the extensive video data uploaded from numerous cameras. To address this challenge, we propose a collaborative edge-cloud solution called E-TALK for ASD. E-TALK leverages the simplicity of voiceprint comparison and processing, as opposed to analyzing video sequences. It utilizes voiceprint consistency as the criterion for determining if there has been a change in the active speaker. Our research focuses on evaluating the performance and computational costs of different voiceprint features and recognition models in speaker identification tasks. Additionally, E-TALK introduces a potential speaker tracking scheme for fixed-angle cameras, in conjunction with foreground extraction algorithms. Moreover, E-TALK incorporates a cloud-based high-precision facial ASD model, which utilizes historical information to determine the active speaker in real-time. We conducted experiments to evaluate the performance of our proposed solution in various scenarios and settings. The results demonstrate the effectiveness of the E-TALK approach in improving active speaker detection, highlighting its potential for practical application in surveillance systems.

引用

页数：9

共 38 条

[31] Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image
Tome, Denis
Russell, Chris
Agapito, Lourdes
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5689 - 5698
[32] Model Selection and Psychological Theory: A Discussion of the Differences Between the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC)
Vrieze, Scott I.
[J]. PSYCHOLOGICAL METHODS, 2012, 17 (02) : 228 - 243
[33] AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition
Wang, Yulin
Yue, Yang
Lin, Yuanze
Jiang, Haojun
Lai, Zihang
Kulikov, Victor
Orlov, Nikita
Shi, Humphrey
Huang, Gao
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 20030 - 20040
[34] Welch G., 1995, technical report 95-041
[35] Xingyi Zhou, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12349), P474, DOI 10.1007/978-3-030-58548-8_28
[36] VID-WIN: Fast Video Event Matching With Query-Aware Windowing at the Edge for the Internet of Multimedia Things
Yadav, Piyush
Salwala, Dhaval
Curry, Edward
[J]. IEEE INTERNET OF THINGS JOURNAL, 2021, 8 (13): : 10367 - 10389
[37] ANTIGONE: Accurate Navigation Path Caching in Dynamic Road Networks leveraging Route APIs
Yu, Xiaojing
Li, Xiang-Yang
Zhao, Jing
Shen, Guobin
Freris, Nikolaos M.
Zhang, Lan
[J]. IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022), 2022, : 1599 - 1608
[38] S3FD: Single Shot Scale-invariant Face Detector
Zhang, Shifeng
Zhu, Xiangyu
Lei, Zhen
Shi, Hailin
Wang, Xiaobo
Li, Stan Z.
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 192 - 201

← 1 2 3 4 →