Enhancing Korean Named Entity Recognition With Linguistic Tokenization Strategies

被引:7
作者
Kim, Gyeongmin [1 ]
Son, Junyoung [1 ]
Kim, Jinsung [1 ]
Lee, Hyunhee [1 ]
Lim, Heuiseok [1 ]
机构
[1] Korea Univ, Dept Comp Sci & Engn, Seoul 02841, South Korea
关键词
Tokenization; Task analysis; Linguistics; Hidden Markov models; Semantics; Syntactics; Solid modeling; Named entity recognition; Korean pre-trained language model; natural language processing; tokenization; linguistic segmentation; agglutinative language; REPRESENTATION;
D O I
10.1109/ACCESS.2021.3126882
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Tokenization is a significant primary step for the training of the Pre-trained Language Model (PLM), which alleviates the challenging Out-of-Vocabulary problem in the area of Natural Language Processing. As tokenization strategies can change linguistic understanding, it is essential to consider the composition of input features based on the characteristics of the language for model performance. This study answers the question of "Which tokenization strategy enhances the characteristics of the Korean language for the Named Entity Recognition (NER) task based on a language model?" focusing on tokenization, which significantly affects the quality of input features. We present two significant challenges for the NER task with the agglutinative characteristics in the Korean language. Next, we quantitatively and qualitatively analyze the coping process of each tokenization strategy for these challenges. By adopting various linguistic segmentation such as morpheme, syllable and subcharacter, we demonstrate the effectiveness and prove the performance between PLMs based on each tokenization strategy. We validate that the most consistent strategy for the challenges of the Korean language is a syllable based on Sentencepiece.
引用
收藏
页码:151814 / 151823
页数:10
相关论文
共 50 条
  • [1] Borthwick A., 1998, P 7 MESS UND C MUC 7
  • [2] Brown Tom, 2020, ADV NEURAL INFORM PR
  • [3] Chinchor N.A, 1998, P MUC 7
  • [4] An Empirical Study of Korean Sentence Representation with Various Tokenizations
    Cho, Danbi
    Lee, Hyunyoung
    Kang, Seungshik
    [J]. ELECTRONICS, 2021, 10 (07)
  • [5] Cho W.I., 2020, P 2 WORKSHOP NLP OPE, P85
  • [6] Chung Euisok., 2003, Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume, V11, P161, DOI DOI 10.3115/1118935.1118956
  • [7] Collobert R, 2011, J MACH LEARN RES, V12, P2493
  • [8] De Meulder F, 2003, NAACL
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Hidden Markov models
    Eddy, SR
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 1996, 6 (03) : 361 - 365