AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION

被引：0

作者：

Roth, Joseph ^{[1
]}

Chaudhuri, Sourish ^{[1
]}

Klejch, Ondrej ^{[1
]}

Marvin, Radhika ^{[1
]}

Gallagher, Andrew ^{[1
]}

Kaver, Liat ^{[1
]}

Ramaswamy, Sharadh ^{[1
]}

Stopczynski, Arkadiusz ^{[1
]}

Schmid, Cordelia ^{[1
]}

Xi, Zhonghua ^{[1
]}

Pantofaru, Caroline ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

multimodal; audio-visual; active speaker detection; dataset;

D O I：

10.1109/icassp40776.2020.9053900

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.

引用

页码：4492 / 4496

页数：5

共 50 条

[1] Target Active Speaker Detection with Audio-visual Cues
Jiang, Yidi
Tao, Ruijie
Pan, Zexu
Li, Haizhou
[J]. INTERSPEECH 2023, 2023, : 3152 - 3156
[2] RETHINKING AUDIO-VISUAL SYNCHRONIZATION FOR ACTIVE SPEAKER DETECTION
Wuerkaixi, Abudukelimu
Zhang, You
Duan, Zhiyao
Zhang, Changshui
[J]. 2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
[3] Active Speaker Detection Using Audio-Visual Sensor Array
Kheradiya, Jatin
Reddy, Sandeep C.
Hegde, Rajesh
[J]. 2014 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2014, : 480 - 484
[4] Active Speaker Detection with Audio-Visual Co-training
Chakravarty, Punarjay
Zegers, Jeroen
Tuytelaars, Tinne
Van Hamme, Hugo
[J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 312 - 316
[5] AS-Net: active speaker detection using deep audio-visual attention
Radman, Abduljalil
Laaksonen, Jorma
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72027 - 72042
[6] AVA-AVD: Audio-Visual Speaker Diarization in the Wild
Xu, Eric Zhongcong
Song, Zeyang
Tsutsui, Satoshi
Feng, Chao
Ye, Mang
Shou, Mike Zheng
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3838 - 3847
[7] Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
Sharma, Rahul
Narayanan, Shrikanth
[J]. IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2023, 4 : 225 - 232
[8] Active Speaker Detection Using Audio, Visual, and Depth Modalities: A Survey
Robi, Siti Nur Aisyah Mohd
Ariffin, Muhammad Atiff Zakwan Mohd
Izhar, Mohd Azri Mohd
Ahmad, Norulhusna
Kaidi, Hazilah Mad
[J]. IEEE ACCESS, 2024, 12 : 96617 - 96634
[9] WASD: A Wilder Active Speaker Detection Dataset
Roxo, Tiago
Costa, Joana Cabral
Inacio, Pedro R. M.
Proenca, Hugo
[J]. IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2025, 7 (01): : 61 - 70
[10] BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION
Braga, Otavio
Siohan, Olivier
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6047 - 6051

← 1 2 3 4 5 →