Segmenting Words in Thai Language Using Minimum Text Units and Conditional Random Field

被引:5
作者
Paripremkul, Kannikar [1 ]
Sornil, Ohm [1 ]
机构
[1] Natl Inst Dev Adm NIDA, Grad Sch Appl Stat, Bangkok, Thailand
关键词
word segmentation; syllable segmentation; minimum text unit; conditional random field; SEGMENTATION;
D O I
10.12720/jait.12.2.135-141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Word segmentation is important to natural language processing tasks. Thai language as well as many Asian languages does not have word delimiter. Word segmentation in Thai language does not only require to focus on dividing a sequence of characters into meaningful words, but the word must also be divided correctly and relevant to the context of a sentence. With the popularity of social media, unknown, informal and slang words are widely used, in addition to words adopted from other languages. Word segmentation methods, generally trained from formal corpuses or dictionaries, do not yield good performance. This research proposes a novel technique to Thai word segmentation where the smallest units constituting words are first extracted, then combined into syllables using Conditional Random Field. Words are then segmented by merging the syllables together with a set of rules learned from language characteristics. The technique is evaluated on both formal and informal datasets against a method based on a convolutional neural network, currently giving the best performance for Thai word segmentation. The results show that the proposed method outperforms the comparing system and gives F-score of 0.9965 and 0.9857 for formal and informal text, respectively.
引用
收藏
页码:135 / 141
页数:7
相关论文
共 50 条
[21]   Multi-scale segmentation squeeze-and-excitation UNet with conditional random field for segmenting lung tumor from CT images [J].
Zhang, Baihua ;
Qi, Shouliang ;
Wu, Yanan ;
Pan, Xiaohuan ;
Yao, Yudong ;
Qian, Wei ;
Guan, Yubao .
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2022, 222
[22]   A Conditional Random Field based Indian Sign Language Recognition System under Complex Background [J].
Choudhury, Ananya ;
Talukdar, Anjan Kumar ;
Sarma, Kandarpa Kumar .
2014 FOURTH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT), 2014, :900-904
[23]   Handwritten Chinese/Japanese Text Recognition Using Semi-Markov Conditional Random Fields [J].
Zhou, Xiang-Dong ;
Wang, Da-Han ;
Tian, Feng ;
Liu, Cheng-Lin ;
Nakagawa, Masaki .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (10) :2413-2426
[24]   Across-camera object tracking using a conditional random field model [J].
Sheng-Tzong Cheng ;
Chih-Wei Hsu ;
Gwo-Jiun Horng ;
Sz-Yu Chen .
The Journal of Supercomputing, 2021, 77 :14252-14279
[25]   Across-camera object tracking using a conditional random field model [J].
Cheng, Sheng-Tzong ;
Hsu, Chih-Wei ;
Horng, Gwo-Jiun ;
Chen, Sz-Yu .
JOURNAL OF SUPERCOMPUTING, 2021, 77 (12) :14252-14279
[26]   Automatic facial expression recognition in an image sequence using conditional random field [J].
Roshanzamir, Mohamad ;
Roshanzamir, Mahdi ;
Mirzaei, Abdolreza ;
Darbandy, Mohammad Tayarani ;
Shoeibi, Afshin ;
Alizadehsani, Roohallah ;
Khozeimeh, Fahime ;
Khosravi, Abbas .
2022 IEEE 22ND INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND INFORMATICS AND 8TH IEEE INTERNATIONAL CONFERENCE ON RECENT ACHIEVEMENTS IN MECHATRONICS, AUTOMATION, COMPUTER SCIENCE AND ROBOTICS (CINTI-MACRO), 2022, :271-277
[27]   Influence of rotated anisotropy on slope reliability evaluation using conditional random field [J].
Huang, L. ;
Cheng, Y. M. ;
Leung, Y. F. ;
Li, L. .
COMPUTERS AND GEOTECHNICS, 2019, 115
[28]   Event Information Extraction from Indonesian Tweets using Conditional Random Field [J].
Muhammad, Fawwaz ;
Khodra, Masayu Leylia .
2015 2ND INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS: CONCEPTS, THEORY AND APPLICATIONS ICAICTA, 2015,
[29]   Forecasting Energy Demand Using Conditional Random Field and Convolution Neural Network [J].
Thangavel, Aravind ;
Govindaraj, Vijayakumar .
ELEKTRONIKA IR ELEKTROTECHNIKA, 2022, 28 (05) :12-22
[30]   A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features [J].
Ali, Mushtaq ;
Khan, Muzammil ;
Alharbi, Yasser .
PEERJ COMPUTER SCIENCE, 2024, 10