AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION

被引:0
|
作者
Roth, Joseph [1 ]
Chaudhuri, Sourish [1 ]
Klejch, Ondrej [1 ]
Marvin, Radhika [1 ]
Gallagher, Andrew [1 ]
Kaver, Liat [1 ]
Ramaswamy, Sharadh [1 ]
Stopczynski, Arkadiusz [1 ]
Schmid, Cordelia [1 ]
Xi, Zhonghua [1 ]
Pantofaru, Caroline [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
multimodal; audio-visual; active speaker detection; dataset;
D O I
10.1109/icassp40776.2020.9053900
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
引用
收藏
页码:4492 / 4496
页数:5
相关论文
共 50 条
  • [1] Supplementary Material: AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection
    Roth, Joseph
    Chaudhuri, Sourish
    Klejch, Ondrej
    Marvin, Radhika
    Gallagher, Andrew
    Kaver, Liat
    Ramaswamy, Sharadh
    Stopczynski, Arkadiusz
    Schmid, Cordelia
    Xi, Zhonghua
    Pantofaru, Caroline
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 3718 - 3722
  • [2] Target Active Speaker Detection with Audio-visual Cues
    Jiang, Yidi
    Tao, Ruijie
    Pan, Zexu
    Li, Haizhou
    INTERSPEECH 2023, 2023, : 3152 - 3156
  • [3] RETHINKING AUDIO-VISUAL SYNCHRONIZATION FOR ACTIVE SPEAKER DETECTION
    Wuerkaixi, Abudukelimu
    Zhang, You
    Duan, Zhiyao
    Zhang, Changshui
    2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
  • [4] Active Speaker Detection Using Audio-Visual Sensor Array
    Kheradiya, Jatin
    Reddy, Sandeep C.
    Hegde, Rajesh
    2014 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2014, : 480 - 484
  • [5] Active Speaker Detection with Audio-Visual Co-training
    Chakravarty, Punarjay
    Zegers, Jeroen
    Tuytelaars, Tinne
    Van Hamme, Hugo
    ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 312 - 316
  • [6] AS-Net: active speaker detection using deep audio-visual attention
    Radman, Abduljalil
    Laaksonen, Jorma
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (28) : 72027 - 72042
  • [7] Audio-visual active speaker tracking in cluttered indoors environments
    Talantzis, Fotios
    Pnevmatikakis, Aristodemos
    Constantinides, Anthony G.
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 2008, 38 (03): : 799 - 807
  • [8] AVA-AVD: Audio-Visual Speaker Diarization in the Wild
    Xu, Eric Zhongcong
    Song, Zeyang
    Tsutsui, Satoshi
    Feng, Chao
    Ye, Mang
    Shou, Mike Zheng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3838 - 3847
  • [9] PUSH-PULL: CHARACTERIZING THE ADVERSARIAL ROBUSTNESS FOR AUDIO-VISUAL ACTIVE SPEAKER DETECTION
    Chen, Xuanjun
    Wu, Haibin
    Meng, Helen
    Lee, Hung-yi
    Jang, Jyh-Shing Roger
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 692 - 699
  • [10] Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model
    Gebru, Israel D.
    Ba, Sileye
    Evangelidis, Georgios
    Horaud, Radu
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOP (ICCVW), 2015, : 702 - 708