Multimedia fusion in automatic extraction of studio speech segments for spoken document retrieval

被引:0
作者
Hui, PY [1 ]
Lo, WK [1 ]
Meng, HM [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Human Comp Commun Lab, Shatin, Hong Kong, Peoples R China
来源
2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING | 2003年
关键词
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper describes our progress in Cantonese spoken document retrieval. Over 60 hours of Cantonese television news broadcasts have been collected as part of AoE-IT Multimedia Repository. We have also developed the Multimedia Markup Language (MmML) for annotating the multimedia content in terms of anchor/field video frames and audio recordings. The audio tracks are indexed by a Cantonese syllable recognizer. Our investigation indicates that there is a large discrepancy in recognition performance, i.e. dropping from 59% to 39% in syllable accuracy (and corresponding reliability in audio indexing), as we move from anchor speech recorded in the studio to reporter/interview speech recorded in the field. Hence we present several automatic methods to extract anchor/studio speech from the audio tracks for retrieval: (i) extraction based only on video information using a fuzzy c-means algorithm; (ii) extraction based only on audio information using Gaussian Mixture Models; and (iii) a fusion strategy that combines video- and audio-based extraction. This paper presents the performance of various extraction techniques and the related retrieval performance in a known-item spoken document retrieval task.
引用
收藏
页码:724 / 727
页数:4
相关论文
共 37 条