A HMM POS Tagger for Micro-blogging Type Texts

被引:0
作者
Nand, Parma [1 ]
Perera, Rivindu [1 ]
Lal, Ramesh [1 ]
机构
[1] Auckland Univ Technol, Sch Comp & Math Sci, Auckland 1010, New Zealand
来源
PRICAI 2014: TRENDS IN ARTIFICIAL INTELLIGENCE | 2014年 / 8862卷
关键词
INFORMATION EXTRACTION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The high volume of communication via micro-blogging type messages has created an increased demand for text processing tools customised the unstructured text genre. The available text processing tools developed on structured texts has been shown to deteriorate significantly when used on unstructured, micro-blogging type texts. In this paper, we present the results of testing a HMM based POS (Part-Of-Speech) tagging model customized for unstructured texts. We also evaluated the tagger against published CRF based state-of-the-art POS tagging models customized for Tweet messages using three publicly available Tweet corpora. Finally, we did cross-validation tests with both the taggers by training them on one Tweet corpus and testing them on another one. The results show that the CRF-based POS tagger from GATE performed approximately 8% better compared to the HMM (Hidden Markov Model) model at token level, however at the sentence level the performances were approximately the same. The cross-validation experiments showed that both tagger's results deteriorated by approximately 25% at the token level and a massive 80% at the sentence level. A detailed analysis of this deterioration is presented and the HMM trained model including the data has also been made available for research purposes. Since HMM training is orders of magnitude faster compared to CRF training, we conclude that the HMM model, despite trailing by about 8% for token accuracy, is still a viable alternative for real time applications which demand rapid as well as progressive learning.
引用
收藏
页码:157 / 169
页数:13
相关论文
共 22 条
[1]  
[Anonymous], 2011, P ACL
[2]  
[Anonymous], INT C COMP LING
[3]  
Banko M, 2007, 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P2670
[4]  
CHOUDHURY M, 2007, P IJCAI WORKSH AN NO, P63
[5]  
Cooper R, 2005, LECT NOTES COMPUT SC, V3513, P388
[6]  
Derczynski L., 2013, P INT C REC ADV NAT
[7]  
Derczynski Leon., 2013, Proceedings of the 24th ACM Conference on Hypertext and Social Media, P21, DOI DOI 10.1145/2481492.2481495
[8]   Open Information Extraction from the Web [J].
Etzioni, Oren ;
Banko, Michele ;
Soderland, Stephen ;
Weld, Daniel S. .
COMMUNICATIONS OF THE ACM, 2008, 51 (12) :68-74
[9]  
Finin Tim., 2010, Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, P80
[10]   Lexical and discourse analysis of online chat dialog [J].
Forsyth, Eric N. ;
Martell, Craig H. .
ICSC 2007: INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, PROCEEDINGS, 2007, :19-+