Learning bilingual word embedding for automatic text summarization in low resource language

被引：5

作者：

Wijayanti, Rini ^{[1
,3
]}

Khodra, Masayu Leylia ^{[1
,2
]}

Surendro, Kridanto ^{[1
]}

Widyantoro, Dwi H. ^{[1
,2
]}

机构：

[1] Inst Teknol Bandung, Sch Elect Engn & Informat, Bandung, Indonesia

[2] Univ Ctr Excellence Artificial Intelligence Vis, Inst Teknol Bandung, Nat Language Proc & Big Data Analyt U CoE AI VLB, Bandung, Indonesia

[3] Inst Teknol Bandung, Sch Elect Engn & Informat, Bandung 40132, Indonesia

来源：

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES | 2023年 / 35卷 / 04期

关键词：

Bilingual word embedding; Cross -lingual transfer learning; Extractive summarization; Low -resource language;

D O I：

10.1016/j.jksuci.2023.03.015

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Studies in low-resource languages have become more challenging with the increasing volume of texts in today ' s digital era. Also, the lack of labeled data and text processing libraries in a language further widens the research gap between high and low-resource languages, such as English and Indonesian. This has led to the use of a transfer learning approach, which applies pre-trained models to solve similar problems, even in different languages by using bilingual or cross-lingual word embedding. Therefore, this study aims to investigate two bilingual word embedding methods, namely VecMap and BiVec, for Indonesian - English language and evaluates them for bilingual lexicon induction and text summarization tasks. The generated bilingual embedding was compared with MUSE (Multilingual Unsupervised and Supervised Embeddings) as the existing multilingual word created with the generative adversarial network method. Furthermore, the VecMap was improved by creating shared vocabulary spaces and mapping the unshared ones between languages. The result showed the embedding produced by the joint methods of BiVec, performed better in intrinsic evaluation, especially with CSLS (Cross-Domain Similarity Local Scaling) retrieval. Meanwhile, the improved VecMap outperformed the regular type by 16.6% without surpassing the BiVec evaluation score. These methods enabled model transfer between languages when applied to cross-lingual-based text summarization. Moreover, the ROUGE score outperformed classical text summarization by adding only 10% of the training dataset of the target language. (c) 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access

引用

页码：224 / 235

页数：12

共 31 条

[1] Text document summarization using word embedding
Mohd, Mudasir
Jan, Rafiya
Shah, Muzaffar
EXPERT SYSTEMS WITH APPLICATIONS, 2020, 143 (143)
[2] Multi-language transfer learning for low-resource legal case summarization
Moro, Gianluca
Piscaglia, Nicola
Ragazzi, Luca
Italiani, Paolo
ARTIFICIAL INTELLIGENCE AND LAW, 2024, 32 (04) : 1111 - 1139
[3] Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya
Fesseha, Awet
Xiong, Shengwu
Emiru, Eshete Derb
Diallo, Moussa
Dahou, Abdelghani
INFORMATION, 2021, 12 (02) : 1 - 17
[4] Automatic Text Summarization of Konkani Folk Tales Using Supervised Machine Learning Algorithms and Language Independent Features
D'Silva, Jovi
Sharma, Uzzal
IETE JOURNAL OF RESEARCH, 2023, 69 (09) : 6162 - 6175
[5] Learning Chinese-Japanese Bilingual Word Embedding by Using Common Characters
Wang, Jilei
Luo, Shiying
Li, Yanning
Xia, Shu-Tao
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2016, 2016, 9983 : 82 - 93
[6] ParaSum: Contrastive Paraphrasing for Low-Resource Extractive Text Summarization
Tang, Moming
Wang, Chengyu
Wang, Jianing
Chen, Cen
Gao, Ming
Qian, Weining
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT III, KSEM 2023, 2023, 14119 : 106 - 119
[7] Low Resource Summarization using Pre-trained Language Models
Munaf, Mubashir
Afzal, Hammad
Mahmood, Khawir
Iltaf, Naima
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (10)
[8] Multi-Perspective Transfer Learning for Automatic MOS Prediction of Low Resource Language
Yin, Pengkai
Liu, Rui
Bao, Feilong
Gao, Guanglai
2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 286 - 290
[9] Automatic Labeling of Clusters for a Low-Resource Urdu Language
Nasim, Zarmeen
Haider, Sajjad
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (05)
[10] Optimizing low-resource language encoders for text-to-image generation: a case study on ThaiOptimizing low-resource language
Thitirat Siriborvornratanakul
Songpol Bunyang
Multimedia Systems, 2025, 31 (3)

← 1 2 3 4 →