AV16.3: An audio-visual corpus for speaker localization and tracking

被引:0
作者
Lathoud, G [1 ]
Odobez, JM
Gatica-Perez, D
机构
[1] IDIAP Res Inst, CH-1920 Martigny, Switzerland
[2] Ecole Polytech Fed Lausanne, CH-1015 Lausanne, Switzerland
来源
MACHINE LEARNING FOR MULTIMODAL INTERACTION | 2005年 / 3361卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the ground-truth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, video and audio-visual speaker localization and tracking. The desired location annotation can be either 2-dimensional (image plane) or 3-dimensional (physical space). This paper motivates and describes a corpus of audio-visual data called "AV16.3", along with a method for 3-D location annotation based on calibrated cameras. "16.3" stands for 16 microphones and 3 cameras, recorded in a fully synchronized manner, in a meeting room. Part of this corpus has already been successfully used to report research results.
引用
收藏
页码:182 / 195
页数:14
相关论文
共 9 条
  • [1] ALGAZI V, 2001, P WASPAA
  • [2] DIBIASE JH, 2001, ROBUST LOCALIZATION, P157
  • [3] JANIN A, 2003, ICSI M CORP P ICASSP
  • [4] LATHOUD G, 2004, IN PRESS P SAPA
  • [5] MOORE D, 2002, COM0207 IDIAP
  • [6] Moving-talker, speaker-independent feature study, and baseline results using the WAVE Multimodal speech corpus
    Patterson, EK
    Gurbuz, S
    Tufekci, Z
    Gowdy, JN
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1189 - 1201
  • [7] Perez P., 2002, P ECCV
  • [8] Shriberg E., 2001, P EUROSPEECH, P1359
  • [9] SVOBODA T, MULTICAMERA SELF CAL