AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION

被引:0
作者
Roth, Joseph [1 ]
Chaudhuri, Sourish [1 ]
Klejch, Ondrej [1 ]
Marvin, Radhika [1 ]
Gallagher, Andrew [1 ]
Kaver, Liat [1 ]
Ramaswamy, Sharadh [1 ]
Stopczynski, Arkadiusz [1 ]
Schmid, Cordelia [1 ]
Xi, Zhonghua [1 ]
Pantofaru, Caroline [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
multimodal; audio-visual; active speaker detection; dataset;
D O I
10.1109/icassp40776.2020.9053900
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
引用
收藏
页码:4492 / 4496
页数:5
相关论文
共 50 条
  • [31] UniCon: Unified Context Network for Robust Active Speaker Detection
    Zhang, Yuanhang
    Liang, Susan
    Yang, Shuang
    Liu, Xiao
    Wu, Zhongqin
    Shan, Shiguang
    Chen, Xilin
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3964 - 3972
  • [32] Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT
    Shi, Bowen
    Mohamed, Abdelrahman
    Hsu, Wei-Ning
    [J]. INTERSPEECH 2022, 2022, : 4785 - 4789
  • [33] ACTIVE SPEAKER DETECTION IN HUMAN MACHINE MULTIPARTY DIALOGUE USING VISUAL PROSODY INFORMATION
    Haider, Fasih
    Campbell, Nick
    Luz, Saturnino
    [J]. 2016 IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (GLOBALSIP), 2016, : 1207 - 1211
  • [34] AVQA: A Dataset for Audio-Visual Question Answering on Videos
    Yang, Pinci
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Hou, Runze
    Jin, Cong
    Zhu, Wenwu
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3480 - 3491
  • [35] Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings
    Yang, Chenyu
    Chen, Mengxi
    Wang, Yanfeng
    Wang, Yu
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4031 - 4041
  • [36] Getting More for Less: Using Weak Labels and AV-Mixup for Robust Audio-Visual Speaker Verification
    Selvakumar, Anith
    Fashandi, Homa
    [J]. INTERSPEECH 2024, 2024, : 4728 - 4732
  • [37] AUDIO-VISUAL SPEECH ENHANCEMENT METHOD CONDITIONED ON THE LIP MOTION AND SPEAKER-DISCRIMINATIVE EMBEDDINGS
    Ito, Koichiro
    Yamamoto, Masaaki
    Nagamatsu, Kenji
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6668 - 6672
  • [38] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [39] An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement
    Sun, Zhongbo
    Wang, Yannan
    Cao, Li
    [J]. MULTIMEDIA MODELING (MMM 2020), PT II, 2020, 11962 : 722 - 728
  • [40] THE XMUSPEECH SYSTEM FOR AUDIO-VISUAL TARGET SPEAKER EXTRACTION IN MISP 2023 CHALLENGE<bold> </bold>
    Luo, Longjie
    Li, Tao
    Li, Lin
    Hong, Qingyang
    [J]. 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 39 - 40