Enhancing Chinese Word Segmentation via Pseudo Labels for Practicability

被引:0
|
作者
Huang, Kaiyu [1 ]
Liu, Junpeng [1 ]
Huang, Degen [1 ]
Xiong, Deyi [2 ,3 ]
Liu, Zhuang [4 ]
Su, Jinsong [5 ]
机构
[1] Dalian Univ Technol, Dalian, Peoples R China
[2] Tianjin Univ, Tianjin, Peoples R China
[3] Global Tone Commun Technol Co Ltd, Beijing, Peoples R China
[4] Dongbei Univ Finance & Econ, Dalian, Peoples R China
[5] Xiamen Univ, Xiamen, Peoples R China
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021 | 2021年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained language models (e.g., BERT) significantly alleviate two traditional challenging problems for Chinese word segmentation (CWS): segmentation ambiguity and out-of-vocabulary (OOV) words. However, such improvements are usually achieved on traditional benchmark datasets and not close to an important goal of CWS: practicability (i.e., low complexity as a standalone task and high beneficiality to downstream tasks). To make a trade-off between traditional evaluation and practicability for CWS, we propose a semisupervised neural method via pseudo labels. The neural method consists of a teacher model and a student model, which distills knowledge from unlabeled data to the student model so as to improve both in-domain and out-of-domain CWS. Experiments show that our proposed method can not only keep the practicability of the lightweight student model but also improve the performance of segmentation effectively. We also evaluate a range of heterogeneous neural architectures of CWS on downstream Chinese NLP tasks. Results of further experiments demonstrate that our proposed segmenter is reliable and practical as a pre-processing step of the downstream NLP tasks at the minimum cost.(1)
引用
收藏
页码:4369 / 4381
页数:13
相关论文
共 50 条
  • [31] New Cyber Word Discovery Using Chinese Word Segmentation
    Wang, Hao
    Wang, Bing
    Zou, MengYu
    Duan, JianYong
    PROCEEDINGS OF 2019 IEEE 3RD INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2019), 2019, : 970 - 975
  • [32] Which is essential for Chinese word segmentation: Character versus word
    Huang, Chang-Ning
    Zhao, Hai
    PACLIC 20: PROCEEDINGS OF THE 20TH PACIFIC ASIA CONFERENCE ON LANGUAGE, INFORMATION AND COMPUTATION, 2006, : 1 - 12
  • [33] Learning pseudo labels for semi-and-weakly supervised semantic segmentation
    Wang, Yude
    Zhang, Jie
    Kan, Meina
    Shan, Shiguang
    PATTERN RECOGNITION, 2022, 132
  • [34] Chinese Word Segmentation via BiLSTM plus Semi-CRF with Relay Node
    Qun, Nuo
    Yan, Hang
    Qiu, Xi-Peng
    Huang, Xuan-Jing
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2020, 35 (05) : 1115 - 1126
  • [35] A Chinese Word Segmentation Based on Machine Learning
    Wang Hongsheng
    Cui Mingming
    PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON EDUCATION TECHNOLOGY AND COMPUTER SCIENCE, VOL II, 2009, : 610 - 613
  • [36] Improved fast algorithm for Chinese word segmentation
    Chen, Guilin
    Wang, Yongcheng
    Han, Kesong
    Wang, Gang
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2000, 37 (04): : 418 - 424
  • [37] Multiple Character Embeddings for Chinese Word Segmentation
    Wang, Jingkang
    Zhou, Jianing
    Zhou, Jie
    Liu, Gongshen
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019:): STUDENT RESEARCH WORKSHOP, 2019, : 210 - 216
  • [38] An improved automatic Chinese word segmentation mechanism
    Wang, Hu
    Wang, Qianping
    RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 147 - 150
  • [39] Research on word segmentation for Chinese sign language
    Cheng, Yinchao
    Yin, Baocai
    Sun, Yanfeng
    PACLIC 20 - Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, 2006, : 407 - 413
  • [40] The role of semantic information in Chinese word segmentation
    Chen, Ruqi
    Huang, Linjieqiong
    Perea, Manuel
    Li, Xingshan
    LANGUAGE COGNITION AND NEUROSCIENCE, 2025, 40 (01) : 41 - 55