SpeechFind: Advances in spoken document retrieval for a National Gallery of the Spoken Word

被引:66
作者
Hansen, JH [1 ]
Huang, RQ
Zhou, B
Seadle, M
Deller, JR
Gurijala, AR
Kurimo, M
Angkititrakul, P
机构
[1] Univ Texas, Ctr Robust Speech Syst, Richardson, TX 75083 USA
[2] Univ Colorado, Ctr Spoken Language Res, Robust Speech Proc Grp, Boulder, CO 80302 USA
[3] Michigan State Univ, Main Lib, E Lansing, MI 48824 USA
[4] IBM Corp, TJ Watson Res Ctr, Yorktown Hts, NY USA
来源
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING | 2005年 / 13卷 / 05期
基金
美国国家科学基金会;
关键词
accent classification; broadcast news; document expansion; environmental sniffing; fidelity; fused error score; information retrieval; language modeling; model adaptation; query expansion; robust speech recognition; robustness; security; speech segmentation; spoken document retrieval; watermarking;
D O I
10.1109/TSA.2005.852088
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Advances in formulating spoken document retrieval for a new National Gallery of the Spoken Word (NGSW) are addressed. NGSW is the first large-scale repository of its kind, consisting of speeches, news broadcasts, and recordings from the 20th century. After presenting an overview of the audio stream content of the NGSW, with sample audio files from U.S. Presidents from 1893 to the present, an overall system diagram is proposed with a discussion of critical tasks associated with effective audio information retrieval. These include advanced audio segmentation, speech recognition model adaptation for acoustic background noise and speaker variability, and information retrieval using natural language processing for text query requests that include document and query expansion. For segmentation, a new evaluation criterion entitled fused error score (FES) is proposed, followed by application of the CompSeg segmentation scheme on DARPA Hub4 Broadcast News (30.5% relative improvement in FES) and NGSW data. Transcript generation is demonstrated for a six-decade portion of the NGSW corpus. Novel model adaptation using structure maximum likelihood eigenspace mapping shows a relative 21.7% improvement. Issues regarding copyright assessment and metadata construction are also addressed for the purposes of a sustainable audio collection of this magnitude. Advanced parameter-embedded watermarking is proposed with evaluations showing robustness to correlated noise attacks. Our experimental online system entitled "SpeechFind" is presented, which allows for audio retrieval from a portion of the NGSW corpus. Finally, a number of research challenges such as language modeling and lexicon for changing time periods, speaker trait and identification tracking, as well as new directions, are discussed in order to address the overall task of robust phrase searching in unrestricted audio corpora.
引用
收藏
页码:712 / 730
页数:19
相关论文
共 96 条
[1]  
Adami A., 2002, P ICASSP
[2]   Sequential estimation with optimal forgetting for robust speech recognition [J].
Afify, M ;
Siohan, O .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2004, 12 (01) :19-26
[3]   Combined Bayesian and predictive techniques for rapid speaker adaptation of continuous density hidden Markov models [J].
Ahadi, SM ;
Woodland, PC .
COMPUTER SPEECH AND LANGUAGE, 1997, 11 (03) :187-206
[4]  
Akbacak M, 2003, INT CONF ACOUST SPEE, P113
[5]  
AKBACAK M, 2003, P INTERSPEECH EUR GE, P2177
[6]  
ANGKITITRAKUL P, UNPUB IEEE T SPEECH
[7]  
ANGKITITRAKUL P, IN PRESS IEEE T SPEE
[8]  
[Anonymous], P EUR 03
[9]  
[Anonymous], 1958, INTRO MULTIVARIATE S
[10]  
[Anonymous], P IEEE ACM JOINT C D