A Corpus-Based Approach for Automatic Thai Unknown Word Recognition Using Boosting Techniques

被引:3
作者
Techo, Jakkrit [1 ]
Nattee, Cholwich [1 ]
Theeramunkong, Thanaruk [1 ]
机构
[1] Thammasat Univ, Informat Comp & Commun Technol Sch, Sirindhorn Int Inst Technol, Bangkok, Thailand
来源
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2009年 / E92D卷 / 12期
关键词
unknown word recognition; word boundary detection; data mining; machine learning; ensemble learning;
D O I
10.1587/transinf.E92.D.2321
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While classification techniques can be applied for automatic unknown word recognition in a language without word boundary, it faces with the problem of unbalanced datasets where the number of positive unknown word candidates is dominantly smaller than that of negative candidates. To solve this problem. this paper presents a corpus-based approach that introduces a so-called group-based ranking evaluation technique into ensemble learning in order to generate a sequence of classification models that later collaborate to select the most probable unknown word from multiple candidates. Given a classification model, the group-based ranking evaluation (GRE) is applied to construct a training dataset for learning the succeeding model, by weighing each of its candidates according to their ranks and correctness when the candidates of an unknown word are considered as one group. A number of experiments have been conducted on a large Thai medical text to evaluate performance of the proposed group-based ranking evaluation approach, namely V-GRE. compared to the conventional naive Bayes classifier and our vanilla version without ensemble learning. As the result, the proposed method achieves an accuracy of 90.93 +/- 0.50% when the first rank is selected while it gains 97.26 +/- 0.26% when the top-ten candidates are considered, that is 8.45% and 6.79% improvement over the conventional record-based naive Bayes classifier and the vanilla version. Another result on applying only best features show 93.93 +/- 0.22% and lip to 98.85 +/- 0.15% accuracy for top-1 and top-10, respectively. They are 3.97% and 9.78% improvement over naive Bayes and the vanilla version. Finally. an error analysis is given.
引用
收藏
页码:2321 / 2333
页数:13
相关论文
共 32 条
  • [1] Ando Rie Kubota, 2000, P N AM ASS COMP LING, P241
  • [2] [Anonymous], 1986, P 9 EL ENG C
  • [3] [Anonymous], 2006, P COLING ACL MAIN C
  • [4] ASAHARA M, 2004, P 20 INT C COMP LING, P459
  • [5] Empirical support for Winnow and Weighted-Majority algorithms: Results on a calendar scheduling domain
    Blum, A
    [J]. MACHINE LEARNING, 1997, 26 (01) : 5 - 23
  • [6] CHANG YLJ, 1995, P 3 WORKSH VER LARG, P107
  • [7] Feature-based Thai unknown word boundary identification using Winnow
    Charoenpornsawat, P
    Kijsirikul, B
    Meknavin, S
    [J]. APCCAS '98 - IEEE ASIA-PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS: MICROELECTRONICS AND INTEGRATING SYSTEMS, 1998, : 547 - 550
  • [8] Cheng KS, 1999, J AM SOC INFORM SCI, V50, P218, DOI 10.1002/(SICI)1097-4571(1999)50:3<218::AID-ASI4>3.0.CO
  • [9] 2-1
  • [10] Freund Y., 1999, Journal of Japanese Society for Artificial Intelligence, V14, P771