A web-based Bengali news corpus for named entity recognition

被引:28
作者
Ekbal, Asif [1 ]
Bandyopadhyay, Sivaji [1 ]
机构
[1] Jadavpur Univ, Dept Comp Sci & Engn, Kolkata 700032, India
关键词
web as corpus; news corpus; web-based tagged Bengali news corpus; named entity; named entity recognition;
D O I
10.1007/s10579-008-9064-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The rapid development of language resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. A tagged Bengali news corpus has been developed from the web archive of a widely read Bengali newspaper. A web crawler retrieves the web pages in Hyper Text Markup Language (HTML) format from the news archive. At present, the corpus contains approximately 34 million wordforms. Named Entity Recognition (NER) systems based on pattern based shallow parsing with or without using linguistic knowledge have been developed using a part of this corpus. The NER system that uses linguistic knowledge has performed better yielding highest F-Score values of 75.40%, 72.30%, 71.37%, and 70.13% for person, location, organization, and miscellaneous names, respectively.
引用
收藏
页码:173 / 182
页数:10
相关论文
共 17 条
[1]  
[Anonymous], P LANGUAGE RESOURCES
[2]  
BERTAGNA F, 2004, P LREC 2004, P131
[3]  
BHARATI A, 2001, P 6 NLP PAC RIM S PO
[4]  
Boleda G., 2006, P WAC 06 2 INT WORKS, P19
[5]  
CALZOLARI N, 2003, ISLE DELIVERABLE D 2
[6]   GATE, a general architecture for text engineering [J].
Cunningham, H .
COMPUTERS AND THE HUMANITIES, 2002, 36 (02) :223-254
[7]  
FLETCHER W, 2001, P 3 N AM S CORP LING
[8]  
Fletcher WH, 2004, LANG COMPUT, P191
[9]  
GIGUET E, 2006, P COLING ACL 2006 MA, P271
[10]   Introduction to the special issue on the Web as corpus [J].
Kilgarriff, A ;
Grefenstette, G .
COMPUTATIONAL LINGUISTICS, 2003, 29 (03) :333-347