Thai Named-Entity Recognition Using Class-based Language Modeling on Multiple-sized Subword Units

被引:0
作者
Saykhum, Kwanchiva [1 ,2 ]
Boonpiam, Vataya [1 ]
Thatphithakkul, Nattanun [1 ]
Wutiwiwatchai, Chai [1 ]
Natthee, Cholwich [2 ]
机构
[1] Natl Elect & Comp Technol Ctr, Human Language Technol Lab, Pathum Thani 12120, Thailand
[2] Thammasat Univ, Sch Informat & Comp Technol, Sirindhorn Int Inst Technol, Bangkok 12000, Thailand
来源
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5 | 2008年
关键词
named-entity recognition; subword unit; language modeling;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article investigates as an early work on speech recognition of Thai named-entities, which is a crucial out-of-vocabulary word problem in broadcast news transcription. Motivated by an analysis on Thai-name structure, a statistical class-based language model is applied on multiple-sized subword units with a constraint on subword positions. Subwords can be defined automatically by their statistics. The proposed model is evaluated on Thai person name recognition in broadcast news data. Based on the subword inventory built from a very large training set of Thai names, only 0.7% out-of-vocabulary subwords are found in the test set. The best configured system incorporating both syllable merging and subword clustering algorithms achieves an approximately 40% syllable accuracy with 25% of names fully discovered.
引用
收藏
页码:1586 / +
页数:2
相关论文
共 12 条
  • [1] BAZZI I, P ICSLP 2000, P401
  • [2] *CAMBR U, 2006, HTK BOOK VERS 3 4
  • [3] HIRSIMAKI T, P AKRR 2005, P121
  • [4] JONGTAVEESATAPO.M, 2008, LREC 2008 IN PRESS
  • [5] KASURIYA S, P OR COCOSDA 2003, P105
  • [6] KASURIYA S, P OR COCOSDA 2003, P54
  • [7] KNESER R, P EUROSPEECH 1993, P973
  • [8] ONISHI S, P ORIENTAL COCOSDA 2, P37
  • [9] PARK YH, P EUROSPEECH 2003, P1129
  • [10] PISARN C, P COLING 2004, P529