AVA ACTIVE SPEAKER: AN AUDIO-VISUAL DATASET FOR ACTIVE SPEAKER DETECTION

被引:0
|
作者
Roth, Joseph [1 ]
Chaudhuri, Sourish [1 ]
Klejch, Ondrej [1 ]
Marvin, Radhika [1 ]
Gallagher, Andrew [1 ]
Kaver, Liat [1 ]
Ramaswamy, Sharadh [1 ]
Stopczynski, Arkadiusz [1 ]
Schmid, Cordelia [1 ]
Xi, Zhonghua [1 ]
Pantofaru, Caroline [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
multimodal; audio-visual; active speaker detection; dataset;
D O I
10.1109/icassp40776.2020.9053900
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.
引用
收藏
页码:4492 / 4496
页数:5
相关论文
共 50 条
  • [21] An audio-visual particle filter for speaker tracking on the CLEAR'06 evaluation dataset
    Nickel, Kai
    Gehrig, Tobias
    Ekenel, Hazim K.
    McDonough, John
    Stiefelhagen, Rainer
    MULTIMODAL TECHNOLOGIES FOR PERCEPTION OF HUMANS, 2007, 4122 : 69 - 80
  • [22] A CLOSER LOOK AT AUDIO-VISUAL MULTI-PERSON SPEECH RECOGNITION AND ACTIVE SPEAKER SELECTION
    Braga, Otavio
    Siohan, Olivier
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6863 - 6867
  • [23] Using audio-visual information to understand speaker activity: Tracking active speakers on and off screen
    Hoover, Ken
    Chaudhuri, Sourish
    Pantofaru, Caroline
    Sturdy, Ian
    Slaney, Malcolm
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6558 - 6562
  • [24] Dynamic visual features for audio-visual speaker verification
    Dean, David
    Sridharan, Sridha
    COMPUTER SPEECH AND LANGUAGE, 2010, 24 (02): : 136 - 149
  • [25] Speaker independent audio-visual speech recognition
    Zhang, Y
    Levinson, S
    Huang, T
    2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, 2000, : 1073 - 1076
  • [26] Rethinking the visual cues in audio-visual speaker extraction
    Li, Junjie
    Ge, Meng
    Pan, Zexu
    Cao, Rui
    Wang, Longbiao
    Dang, Jianwu
    Zhang, Shiliang
    INTERSPEECH 2023, 2023, : 3754 - 3758
  • [27] Deep Audio-Visual Beamforming for Speaker Localization
    Qian, Xinyuan
    Zhang, Qiquan
    Guan, Guohui
    Xue, Wei
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1132 - 1136
  • [28] A Bayesian approach to audio-visual speaker identification
    Nefian, AV
    Liang, LH
    Fu, TY
    Liu, XX
    AUDIO-BASED AND VIDEO-BASED BIOMETRIC PERSON AUTHENTICATION, PROCEEDINGS, 2003, 2688 : 761 - 769
  • [29] Audio-Video detection of the active speaker in meetings
    Madrigal, Francisco
    Lerasle, Frederic
    Pibre, Lionel
    Ferrane, Isabelle
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 2536 - 2543
  • [30] ENVIRONMENTALLY ROBUST AUDIO-VISUAL SPEAKER IDENTIFICATION
    Schoenherr, Lea
    Orth, Dennis
    Heckmann, Martin
    Kolossa, Dorothea
    2016 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2016), 2016, : 312 - 318