Segmenting Words in Thai Language Using Minimum Text Units and Conditional Random Field

被引:4
|
作者
Paripremkul, Kannikar [1 ]
Sornil, Ohm [1 ]
机构
[1] Natl Inst Dev Adm NIDA, Grad Sch Appl Stat, Bangkok, Thailand
关键词
word segmentation; syllable segmentation; minimum text unit; conditional random field; SEGMENTATION;
D O I
10.12720/jait.12.2.135-141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Word segmentation is important to natural language processing tasks. Thai language as well as many Asian languages does not have word delimiter. Word segmentation in Thai language does not only require to focus on dividing a sequence of characters into meaningful words, but the word must also be divided correctly and relevant to the context of a sentence. With the popularity of social media, unknown, informal and slang words are widely used, in addition to words adopted from other languages. Word segmentation methods, generally trained from formal corpuses or dictionaries, do not yield good performance. This research proposes a novel technique to Thai word segmentation where the smallest units constituting words are first extracted, then combined into syllables using Conditional Random Field. Words are then segmented by merging the syllables together with a set of rules learned from language characteristics. The technique is evaluated on both formal and informal datasets against a method based on a convolutional neural network, currently giving the best performance for Thai word segmentation. The results show that the proposed method outperforms the comparing system and gives F-score of 0.9965 and 0.9857 for formal and informal text, respectively.
引用
收藏
页码:135 / 141
页数:7
相关论文
共 50 条
  • [1] Part-of-Speech Tagging for Mizo Language Using Conditional Random Field
    Nunsanga, Morrel V. L.
    Pakray, Partha
    Lallawmsanga, C.
    Singh, L. Lolit Kumar
    COMPUTACION Y SISTEMAS, 2021, 25 (04): : 803 - 812
  • [2] Named Entity Recognition in Hindi Using Hyperspace Analogue to Language and Conditional Random Field
    Jain, Arti
    Arora, Anuja
    PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY, 2018, 26 (04): : 1801 - 1822
  • [3] Text Extraction from Video Using Conditional Random Fields
    Peng, Xujun
    Cao, Huaigu
    Prasad, Rohit
    Natarajan, Premkumar
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1029 - 1033
  • [4] Embedding Topic Discovery in Conditional Random Fields Model for Segmenting Nuclei Using Multispectral Data
    Wu, Xuqing
    Amrikachi, Mojgan
    Shah, Shishir K.
    IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, 2012, 59 (06) : 1539 - 1549
  • [5] Toward Advice Mining: Conditional Random Fields for Extracting Advice-revealing Text Units
    Wicaksono, Alfan Farizki
    Myaeng, Sung-Hyon
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 2039 - 2048
  • [6] Identifying gene and protein mentions in text using conditional random fields
    Ryan McDonald
    Fernando Pereira
    BMC Bioinformatics, 6
  • [7] A Generalized Fusion Approach for Segmenting Dermoscopy Images Using Markov Random Field
    Ming, Di
    Wen, Quan
    Chen, Juan
    Liu, Wenhao
    2013 6TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING (CISP), VOLS 1-3, 2013, : 532 - 537
  • [8] Research on Born-Digital Image Text Extraction Based on Conditional Random Field
    Zhang, Jian
    Cheng, Renhong
    Wang, Kai
    Zhao, Hong
    Jiao, Jiao
    2013 EIGHTH INTERNATIONAL CONFERENCE ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC 2013), 2013, : 364 - 368
  • [9] Natural Language Processing for Disaster Management Using Conditional Random Fields
    Ketmaneechairat, Hathairat
    Maliyaem, Maleerat
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2020, 11 (02) : 97 - 102
  • [10] LANGUAGE RECOGNITION USING DEEP-STRUCTURED CONDITIONAL RANDOM FIELDS
    Yu, Dong
    Wang, Shizhen
    Karam, Zahi
    Deng, Li
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5030 - 5033