Learning bilingual word embedding for automatic text summarization in low resource language

被引：5

作者：

Wijayanti, Rini ^{[1
,3
]}

Khodra, Masayu Leylia ^{[1
,2
]}

Surendro, Kridanto ^{[1
]}

Widyantoro, Dwi H. ^{[1
,2
]}

机构：

[1] Inst Teknol Bandung, Sch Elect Engn & Informat, Bandung, Indonesia

[2] Univ Ctr Excellence Artificial Intelligence Vis, Inst Teknol Bandung, Nat Language Proc & Big Data Analyt U CoE AI VLB, Bandung, Indonesia

[3] Inst Teknol Bandung, Sch Elect Engn & Informat, Bandung 40132, Indonesia

来源：

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES | 2023年 / 35卷 / 04期

关键词：

Bilingual word embedding; Cross -lingual transfer learning; Extractive summarization; Low -resource language;

D O I：

10.1016/j.jksuci.2023.03.015

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Studies in low-resource languages have become more challenging with the increasing volume of texts in today ' s digital era. Also, the lack of labeled data and text processing libraries in a language further widens the research gap between high and low-resource languages, such as English and Indonesian. This has led to the use of a transfer learning approach, which applies pre-trained models to solve similar problems, even in different languages by using bilingual or cross-lingual word embedding. Therefore, this study aims to investigate two bilingual word embedding methods, namely VecMap and BiVec, for Indonesian - English language and evaluates them for bilingual lexicon induction and text summarization tasks. The generated bilingual embedding was compared with MUSE (Multilingual Unsupervised and Supervised Embeddings) as the existing multilingual word created with the generative adversarial network method. Furthermore, the VecMap was improved by creating shared vocabulary spaces and mapping the unshared ones between languages. The result showed the embedding produced by the joint methods of BiVec, performed better in intrinsic evaluation, especially with CSLS (Cross-Domain Similarity Local Scaling) retrieval. Meanwhile, the improved VecMap outperformed the regular type by 16.6% without surpassing the BiVec evaluation score. These methods enabled model transfer between languages when applied to cross-lingual-based text summarization. Moreover, the ROUGE score outperformed classical text summarization by adding only 10% of the training dataset of the target language. (c) 2023 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access

引用

页码：224 / 235

页数：12

共 31 条

[21] Enhancing low-resource cross-lingual summarization from noisy data with fine-grained reinforcement learning
Huang, Yuxin
Gu, Huailing
Yu, Zhengtao
Gao, Yumeng
Pan, Tong
Xu, Jialong
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2024, 25 (01) : 121 - 134
[22] Contrastive Learning for Morphological Disambiguation Using Large Language Models in Low-Resource Settings
Tolegen, Gulmira
Toleu, Alymzhan
Mussabayev, Rustam
APPLIED SCIENCES-BASEL, 2024, 14 (21):
[23] Towards Two-Step Fine-Tuned Abstractive Summarization for Low-Resource Language Using Transformer T5
Nasution, Salhazan
Ferdiana, Ridi
Hartanto, Rudy
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2025, 16 (02) : 1220 - 1230
[24] Multi-Task Text Classification using Graph Convolutional Networks for Large-Scale Low Resource Language
Marreddy, Mounika
Oota, Subba Reddy
Vakada, Lakshmi Sireesha
Chinni, Venkata Charan
Mamidi, Radhika
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[25] Transfer Learning for End-to-End ASR to Deal with Low-Resource Problem in Persian Language
Kermanshahi, Maryam Asadolahzade
Akbari, Ahmad
Nasersharif, Babak
2021 26TH INTERNATIONAL COMPUTER CONFERENCE, COMPUTER SOCIETY OF IRAN (CSICC), 2021,
[26] Sentiment analysis on a low-resource language dataset using multimodal representation learning and cross-lingual transfer learning
Gladys, A. Aruna
Vetriselvi, V.
APPLIED SOFT COMPUTING, 2024, 157
[27] DACL: A Domain-Adapted Contrastive Learning Approach to Low Resource Language Representations for Document Clustering Tasks
Zaikis, Dimitrios
Kokkas, Stylianos
Vlahavas, Ioannis
24TH INTERNATIONAL CONFERENCE ON ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EAAAI/EANN 2023, 2023, 1826 : 585 - 598
[28] Semi-supervised and active-learning scenarios: Efficient acoustic model refinement for a low resource Indian language
Chellapriyadharshini, Maharajan
Toffy, Anoop
Raghavan, Srinivasa K. M.
Ramasubramanian, V.
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1041 - 1045
[29] DACL+: domain-adapted contrastive learning for enhanced low-resource language representations in document clustering tasks
Dimitrios Zaikis
Ioannis Vlahavas
Neural Computing and Applications, 2025, 37 (17) : 10577 - 10590
[30] SEMI-SUPERVISED TRANSFER LEARNING FOR LANGUAGE EXPANSION OF END-TO-END SPEECH RECOGNITION MODELS TO LOW-RESOURCE LANGUAGES
Kim, Jiyeon
Kumar, Mehul
Gowda, Dhananjaya
Garg, Abhinav
Kim, Chanwoo
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 984 - 988

← 1 2 3 4 →