Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

被引:415
作者
Toutanova, K [1 ]
Manning, CD [1 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
来源
PROCEEDINGS OF THE 2000 JOINT SIGDAT CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND VERY LARGE CORPORA | 2000年
关键词
D O I
10.3115/1117794.1117802
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents results for a maximum-entropy-based part of speech tagger, which achieves superior performance principally by enriching the information sources used for tagging. In particular, we get improved results by incorporating these features: (i) more extensive treatment of capitalization for unknown words; (ii) features for the disambiguation of the tense forms of verbs; (iii) features for disambiguating particles from prepositions and adverbs. The best resulting accuracy for the tagger on the Penn Treebank is 96.86% overall, and 86.91% on previously unseen words.
引用
收藏
页码:63 / 70
页数:8
相关论文
共 11 条
  • [1] BAKER CL, 1995, ENGLISH SYNTAX
  • [2] Berger AL, 1996, COMPUT LINGUIST, V22, P39
  • [3] Brants T, 2000, 6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, P224
  • [4] BRILL E, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P722
  • [5] Charniak E, 2000, 6TH APPLIED NATURAL LANGUAGE PROCESSING CONFERENCE/1ST MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE AND PROCEEDINGS OF THE ANLP-NAACL 2000 STUDENT RESEARCH WORKSHOP, pA132
  • [6] Jelinek F., 1997, Statistical Methods for Speech Recognition
  • [7] Manning C.D., 1999, FDN STAT NATURAL LAN
  • [8] MIKHEEV A, 1999, THESIS U EDINBURGH
  • [9] Ratnaparkhi A., 1998, MAXIMUM ENTROPY MODE
  • [10] Ratnaparki A, 1996, P C EMP METH NAT LAN, P133