Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch

被引:1
|
作者
Di, Donglin [1 ]
Song, Xianyang [2 ]
Zhang, Weinan [3 ]
Zhang, Yue [4 ]
Wang, Fanglin [1 ]
机构
[1] Adv AI, Res & Dev, 80 Robinson Rd, Singapore 068898, Singapore
[2] Northeast Forestry Univ, Harbin 150040, Peoples R China
[3] Harbin Inst Technol, 92 West Dazhi St, Harbin, Heilongjiang, Peoples R China
[4] Westlake Univ, Hangzhou 310024, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Dialogue datasets; intent classification; slot-filling; indonesian;
D O I
10.1145/3575803
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Using off-the-shelf resources from resource-rich languages to transfer knowledge to low-resource languages has received a lot of attention. The requirements of enabling the model to achieve the reliable performance, including the scale of required annotated data and the effective framework, are not well guided. To address the first question, we empirically investigate the cost-effectiveness of several methods for training intent classification and slot-filling models from scratch in Indonesia (ID) using English data. Confronting the second challenge, we propose a Bi-Confidence-Frequency Cross-Lingual transfer framework (BiCF), which consists of "BiCF Mixing", "Latent Space Refinement" and "Joint Decoder", respectively, to overcome the lack of low-resource language dialogue data. BiCF Mixing based on the word-level alignment strategy generates code-mixed data by utilizing the importance-frequency and translating-confidence. Moreover, Latent Space Refinement trains a new dialogue understanding model using code-mixed data and word embedding models. Joint Decoder based on Bidirectional LSTM (BiLSTM) and Conditional Random Field (CRF) is used to obtain experimental results of intent classification and slot-filling. We also release a large-scale fine-labeled Indonesia dialogue dataset (ID-WOZ(1)) and ID-BERT for experiments. BiCF achieves 93.56% and 85.17% (F1 score) on intent classification and slot filling, respectively. Extensive experiments demonstrate that our framework performs reliably and cost-efficiently on different scales of manually annotated Indonesian data.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Building a Dataset for Misinformation Detection in the Low-Resource Language
    Mukwevho, Mulweli
    Rananga, Seani
    Mbooi, Mahlatse S.
    Isong, Bassey
    Marivate, Vukosi
    2024 IST-AFRICA CONFERENCE, 2024,
  • [2] Bidirectional Representations for Low-Resource Spoken Language Understanding
    Meeus, Quentin
    Moens, Marie-Francine
    Van Hamme, Hugo
    APPLIED SCIENCES-BASEL, 2023, 13 (20):
  • [3] Low-resource Taxonomy Enrichment with Pretrained Language Models
    Takeoka, Kunihiro
    Akimoto, Kosuke
    Oyamada, Masafumi
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2747 - 2758
  • [4] Meta Auxiliary Learning for Low-resource Spoken Language Understanding
    Gao, Yingying
    Feng, Junlan
    Deng, Chao
    Zhang, Shilei
    INTERSPEECH 2022, 2022, : 2703 - 2707
  • [5] Variational model for low-resource natural language generation in spoken dialogue systems
    Tran, Van-Khanh
    Nguyen, Le-Minh
    Computer Speech and Language, 2021, 65
  • [6] Variational model for low-resource natural language generation in spoken dialogue systems
    Van-Khanh Tran
    Le-Minh Nguyen
    COMPUTER SPEECH AND LANGUAGE, 2021, 65
  • [7] Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding
    Wang, Pu
    Van Hamme, Hugo
    INTERSPEECH 2022, 2022, : 1248 - 1252
  • [8] Exploring Large Language Models for Low-Resource IT Information Extraction
    Bhavya, Bhavya
    Isaza, Paulina Toro
    Deng, Yu
    Nidd, Michael
    Azad, Amar Prakash
    Shwartz, Larisa
    Zhai, ChengXiang
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1203 - 1212
  • [9] Fine Tuning Language Models: A Tale of Two Low-Resource Languages
    Rosel OidaOnesa
    Melvin ABallera
    Data Intelligence, 2024, 6 (04) : 946 - 967
  • [10] Large Language Models and Low-Resource Languages: An Examination of Armenian NLP
    Avetisyan, Hayastan
    Broneske, David
    13TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING AND THE 3RD CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, IJCNLP-AACL 2023, 2023, : 199 - 210