AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION

被引:0
|
作者
Roth, Joseph [1 ]
Chaudhuri, Sourish [1 ]
Klejch, Ondrej [1 ]
Marvin, Radhika [1 ]
Gallagher, Andrew [1 ]
Kaver, Liat [1 ]
Ramaswamy, Sharadh [1 ]
Stopczynski, Arkadiusz [1 ]
Schmid, Cordelia [1 ]
Xi, Zhonghua [1 ]
Pantofaru, Caroline [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
multimodal; audio-visual; active speaker detection; dataset;
D O I
10.1109/icassp40776.2020.9053900
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
引用
收藏
页码:4492 / 4496
页数:5
相关论文
共 50 条
  • [11] How to Design a Three-Stage Architecture for Audio-Visual Active Speaker Detection in the Wild
    Kopuklu, Okan
    Taseska, Maja
    Rigoll, Gerhard
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1173 - 1183
  • [12] Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
    Sharma, Rahul
    Narayanan, Shrikanth
    IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2023, 4 : 225 - 232
  • [13] Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
    Jiang, Hao
    Murdock, Calvin
    Ithapu, Vamsi Krishna
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 10534 - 10542
  • [14] Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection
    Tao, Ruijie
    Pan, Zexu
    Das, Rohan Kumar
    Qian, Xinyuan
    Shou, Mike Zheng
    Li, Haizhou
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3927 - 3935
  • [15] Active Speaker Detection Using Audio, Visual, and Depth Modalities: A Survey
    Robi, Siti Nur Aisyah Mohd
    Ariffin, Muhammad Atiff Zakwan Mohd
    Izhar, Mohd Azri Mohd
    Ahmad, Norulhusna
    Kaidi, Hazilah Mad
    IEEE ACCESS, 2024, 12 : 96617 - 96634
  • [16] Audio-Visual Synchronisation for Speaker Diarisation
    Garau, Giulia
    Dielmann, Alfred
    Bourlard, Herve
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2662 - +
  • [17] WASD: A Wilder Active Speaker Detection Dataset
    Roxo, Tiago
    Costa, Joana Cabral
    Inacio, Pedro R. M.
    Proenca, Hugo
    IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2025, 7 (01): : 61 - 70
  • [18] BEST OF BOTH WORLDS: MULTI-TASK AUDIO-VISUAL AUTOMATIC SPEECH RECOGNITION AND ACTIVE SPEAKER DETECTION
    Braga, Otavio
    Siohan, Olivier
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6047 - 6051
  • [19] E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing
    Yu, Xiaojing
    Zhang, Lan
    Li, Xiang-yang
    2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON, 2023,
  • [20] Speaker position detection system using audio-visual information
    Matsuo, N
    Kitagawa, H
    Nagata, S
    FUJITSU SCIENTIFIC & TECHNICAL JOURNAL, 1999, 35 (02): : 212 - 220