Adaptation of maximum entropy capitalizer: Little data can help a lot

被引:77
作者
Chelba, Ciprian [1 ]
Acero, Alex [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
关键词
D O I
10.1016/j.csl.2005.05.005
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A novel technique for maximum "a posteriori" (MAP) adaptation of maximum entropy (MaxEnt) and maximum entropy Markov models (MEMM) is presented. The technique is applied to the problem of automatically capitalizing uniformly cased text. Automatic capitalization is a practically relevant problem: speech recognition output needs to be capitalized; also, modern word processors perform capitalization among other text proofing algorithms such as spelling correction and grammar checking. Capitalization can be also used as a preprocessing step in named entity extraction or machine translation. A "background" capitalizer trained on 20 M words of Wall Street Journal (WSJ) text from 1987 is adapted to two Broadcast News (BN) test sets - one containing ABC Primetime Live text and the other NPR Morning News/CNN Morning Edition text - from 1996. The "in-domain" performance of the WSJ capitalizer is 45% better relative to the 1-gram baseline, when evaluated on a test set drawn from WSJ 1994. When evaluating on the mismatched "out-of-domain" test data, the 1-gram baseline is outperformed by 60% relative; the improvement brought by the adaptation technique using a very small amount of matched BN data - 25-70k words - is about 20-25% relative. Overall, automatic capitalization error rate of 1.4% is achieved on BN data. The performance gain obtained by employing our adaptation technique using a tiny amount of out-of-domain training data on top of the background data is striking: as little as 0.14 M words of in-domain data brings more improvement than using 10 times more background training data (from 2 M words to 20 M words). (c) 2005 Elsevier Ltd. All rights reserved.
引用
收藏
页码:382 / 399
页数:18
相关论文
共 11 条
[1]  
[Anonymous], 2004, P 2004 C EMP METH NA
[2]   The psychology of reactions to environmental agents [J].
Berglund, B ;
Job, RFS .
ENVIRONMENT INTERNATIONAL, 1996, 22 (01) :1-1
[3]  
BRILL E, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P722
[4]   Classification of small B-cell lymphoid neoplasms using a paraffin section immunohistochemical panel [J].
Chen, CC ;
Raikow, RB ;
Sonmez-Alpan, E ;
Swerdlow, SH .
APPLIED IMMUNOHISTOCHEMISTRY & MOLECULAR MORPHOLOGY, 2000, 8 (01) :1-11
[5]  
Goodman J, 2004, HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, P305
[6]   Active components from Artemisia iwayomogi displaying ONOO- scavenging activity [J].
Kim, AR ;
Zou, YN ;
Park, TH ;
Shim, KH ;
Kim, MS ;
Kim, ND ;
Kim, JD ;
Bae, SJ ;
Choi, JS ;
Chung, HY .
PHYTOTHERAPY RESEARCH, 2004, 18 (01) :1-7
[7]  
Lafferty J., 2001, PROC 18 INT C MACHIN, DOI [10.1038/nprot.2006.61, DOI 10.1038/NPROT.2006.61]
[8]  
Lita LV, 2003, 41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P152
[9]  
PAUL DB, 1992, P DARPA SLS WORKSH
[10]  
PIETRA SD, 1995, CMUCS95144