Learning bilingual word embedding for automatic text summarization in low resource language

被引:5
作者
Wijayanti, Rini [1 ,3 ]
Khodra, Masayu Leylia [1 ,2 ]
Surendro, Kridanto [1 ]
Widyantoro, Dwi H. [1 ,2 ]
机构
[1] Inst Teknol Bandung, Sch Elect Engn & Informat, Bandung, Indonesia
[2] Univ Ctr Excellence Artificial Intelligence Vis, Inst Teknol Bandung, Nat Language Proc & Big Data Analyt U CoE AI VLB, Bandung, Indonesia
[3] Inst Teknol Bandung, Sch Elect Engn & Informat, Bandung 40132, Indonesia
关键词
Bilingual word embedding; Cross -lingual transfer learning; Extractive summarization; Low -resource language;
D O I
10.1016/j.jksuci.2023.03.015
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Studies in low-resource languages have become more challenging with the increasing volume of texts in today ' s digital era. Also, the lack of labeled data and text processing libraries in a language further widens the research gap between high and low-resource languages, such as English and Indonesian. This has led to the use of a transfer learning approach, which applies pre-trained models to solve similar problems, even in different languages by using bilingual or cross-lingual word embedding. Therefore, this study aims to investigate two bilingual word embedding methods, namely VecMap and BiVec, for Indonesian - English language and evaluates them for bilingual lexicon induction and text summarization tasks. The generated bilingual embedding was compared with MUSE (Multilingual Unsupervised and Supervised Embeddings) as the existing multilingual word created with the generative adversarial network method. Furthermore, the VecMap was improved by creating shared vocabulary spaces and mapping the unshared ones between languages. The result showed the embedding produced by the joint methods of BiVec, performed better in intrinsic evaluation, especially with CSLS (Cross-Domain Similarity Local Scaling) retrieval. Meanwhile, the improved VecMap outperformed the regular type by 16.6% without surpassing the BiVec evaluation score. These methods enabled model transfer between languages when applied to cross-lingual-based text summarization. Moreover, the ROUGE score outperformed classical text summarization by adding only 10% of the training dataset of the target language. (c) 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access
引用
收藏
页码:224 / 235
页数:12
相关论文
共 31 条
  • [21] Enhancing low-resource cross-lingual summarization from noisy data with fine-grained reinforcement learning
    Huang, Yuxin
    Gu, Huailing
    Yu, Zhengtao
    Gao, Yumeng
    Pan, Tong
    Xu, Jialong
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2024, 25 (01) : 121 - 134
  • [22] Contrastive Learning for Morphological Disambiguation Using Large Language Models in Low-Resource Settings
    Tolegen, Gulmira
    Toleu, Alymzhan
    Mussabayev, Rustam
    APPLIED SCIENCES-BASEL, 2024, 14 (21):
  • [23] Towards Two-Step Fine-Tuned Abstractive Summarization for Low-Resource Language Using Transformer T5
    Nasution, Salhazan
    Ferdiana, Ridi
    Hartanto, Rudy
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2025, 16 (02) : 1220 - 1230
  • [24] Multi-Task Text Classification using Graph Convolutional Networks for Large-Scale Low Resource Language
    Marreddy, Mounika
    Oota, Subba Reddy
    Vakada, Lakshmi Sireesha
    Chinni, Venkata Charan
    Mamidi, Radhika
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [25] Transfer Learning for End-to-End ASR to Deal with Low-Resource Problem in Persian Language
    Kermanshahi, Maryam Asadolahzade
    Akbari, Ahmad
    Nasersharif, Babak
    2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
  • [26] Sentiment analysis on a low-resource language dataset using multimodal representation learning and cross-lingual transfer learning
    Gladys, A. Aruna
    Vetriselvi, V.
    APPLIED SOFT COMPUTING, 2024, 157
  • [27] DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks
    Zaikis, Dimitrios
    Kokkas, Stylianos
    Vlahavas, Ioannis
    24TH INTERNATIONAL CONFERENCE ON ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EAAAI/EANN 2023, 2023, 1826 : 585 - 598
  • [28] Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource Indian language
    Chellapriyadharshini, Maharajan
    Toffy, Anoop
    Raghavan, Srinivasa K. M.
    Ramasubramanian, V.
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1041 - 1045
  • [29] DACL+: domain-adapted contrastive learning for enhanced low-resource language representations in document clustering tasks
    Dimitrios Zaikis
    Ioannis Vlahavas
    Neural Computing and Applications, 2025, 37 (17) : 10577 - 10590
  • [30] SEMI-SUPERVISED TRANSFER LEARNING FOR LANGUAGE EXPANSION OF END-TO-END SPEECH RECOGNITION MODELS TO LOW-RESOURCE LANGUAGES
    Kim, Jiyeon
    Kumar, Mehul
    Gowda, Dhananjaya
    Garg, Abhinav
    Kim, Chanwoo
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 984 - 988