Contemporaneous text as side-information in statistical language modeling

被引:2
作者
Khudanpur, S
Kim, W
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Dept Elect & Comp Engn, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Dept Comp Sci, Baltimore, MD 21218 USA
基金
美国国家科学基金会;
关键词
multi-lingual processing; statistical language modeling; automatic speech recognition; resource-deficient languages; lexical triggers; maximum entropy;
D O I
10.1016/j.csl.2003.09.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose new methods to exploit contemporaneous text, such as on-line news articles, to improve language models for automatic speech recognition and other natural language processing applications. In particular, we investigate the use of text from a resource-rich language to sharpen language models for processing a news story or article in a language with scarce linguistic resources. We demonstrate that even with fairly crude cross-language information retrieval and simple machine translation, one can construct story-specific Chinese language models which exploit cues from a side-corpus of English newswire to significantly improve the performance of language models estimated from a static Chinese corpus. Our investigations cover cases when the amount of available Chinese text is small, and a case when a large Chinese text corpus is available. We examine the effectiveness of our techniques both when the side-corpus contains English documents that are near-translations of the Chinese documents being processed, and when the English side-corpus is merely from contemporaneous and independent news sources. We present experimental results for automatic transcription of speech from the Mandarin Broadcast News corpus. (C) 2003 Elsevier Ltd. All rights reserved.
引用
收藏
页码:143 / 162
页数:20
相关论文
共 21 条
[1]  
ALLAN J, 1998, P J HOPK SUMM WORKSH
[2]  
[Anonymous], P HUM LANG TECHN C
[3]  
BAEZAYATES RA, 1999, MODERN INFORMATION R
[4]  
Berger A, 1998, INT CONF ACOUST SPEE, P705, DOI 10.1109/ICASSP.1998.675362
[5]  
Byrne W, 2000, INT CONF ACOUST SPEE, P1029
[6]  
Clarkson PR, 1997, INT CONF ACOUST SPEE, P799, DOI 10.1109/ICASSP.1997.596049
[7]  
COCARO D, 1998, P INT C SPOK LANG PR, V6, P2403
[8]  
DOERMANN D, 2002, P SPIE C DOC REC RET, P37
[9]  
FUNG P, 2000, P J HOPK SUMM WORKSH
[10]  
GRAFF D, 2000, P TOP DET TRACK WORK