Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

被引:421
作者
Toutanova, K [1 ]
Manning, CD [1 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
来源
PROCEEDINGS OF THE 2000 JOINT SIGDAT CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND VERY LARGE CORPORA | 2000年
关键词
D O I
10.3115/1117794.1117802
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents results for a maximum-entropy-based part of speech tagger, which achieves superior performance principally by enriching the information sources used for tagging. In particular, we get improved results by incorporating these features: (i) more extensive treatment of capitalization for unknown words; (ii) features for the disambiguation of the tense forms of verbs; (iii) features for disambiguating particles from prepositions and adverbs. The best resulting accuracy for the tagger on the Penn Treebank is 96.86% overall, and 86.91% on previously unseen words.
引用
收藏
页码:63 / 70
页数:8
相关论文
共 11 条
[1]  
BAKER CL, 1995, ENGLISH SYNTAX
[2]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[3]  
Brants T, 2000, 6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, P224
[4]  
BRILL E, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P722
[5]  
Charniak E, 2000, 6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, pA132
[6]  
Jelinek F., 1997, Statistical Methods for Speech Recognition
[7]  
Manning C.D., 1999, FDN STAT NATURAL LAN
[8]  
MIKHEEV A, 1999, THESIS U EDINBURGH
[9]  
Ratnaparkhi A., 1998, MAXIMUM ENTROPY MODE
[10]  
Ratnaparki A, 1996, P C EMP METH NAT LAN, P133