Unifying Cross-lingual Summarization and Machine Translation with Compression Rate

被引:5
作者
Bai, Yu [1 ,2 ]
Huang, Heyan [1 ,3 ]
Fan, Kai [4 ]
Gao, Yang [1 ]
Zhu, Yiming [1 ]
Zhan, Jiaao [1 ]
Chi, Zewen [1 ]
Chen, Boxing [4 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci, Beijing, Peoples R China
[2] Beijing Engn Res Ctr High Volume Language Informa, Beijing, Peoples R China
[3] Southeast Acad Informat Technol, Putian, Fujian, Peoples R China
[4] Alibaba DAMO Acad, Machine Intelligence Technol Lab, Hangzhou, Peoples R China
来源
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22) | 2022年
基金
中国国家自然科学基金;
关键词
Cross-lingual Summarization; Machine Translation; Compression Rate;
D O I
10.1145/3477495.3532071
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-Lingual Summarization (CLS) is a task that extracts important information from a source document and summarizes it into a summary in another language. It is a challenging task that requires a system to understand, summarize, and translate at the same time, making it highly related to Monolingual Summarization (MS) and Machine Translation (MT). In practice, the training resources for Machine Translation are far more than that for cross-lingual and monolingual summarization. Thus incorporating the Machine Translation corpus into CLS would be beneficial for its performance. However, the present work only leverages a simple multi-task framework to bring Machine Translation in, lacking deeper exploration. In this paper, we propose a novel task, Cross-lingual Summarization with Compression rate (CSC), to benefit Cross-Lingual Summarization by large-scale Machine Translation corpus. Through introducing compression rate, the information ratio between the source and the target text, we regard the MT task as a special CLS task with a compression rate of 100%. Hence they can be trained as a unified task, sharing knowledge more effectively. However, a huge gap exists between the MT task and the CLS task, where samples with compression rates between 30% and 90% are extremely rare. Hence, to bridge these two tasks smoothly, we propose an effective data augmentation method to produce document-summary pairs with different compression rates. The proposed method not only improves the performance of the CLS task, but also provides controllability to generate summaries in desired lengths. Experiments demonstrate that our method outperforms various strong baselines in three cross-lingual summarization datasets. We released our code and data at https://github.com/ybai-nlp/CLS_CIR.
引用
收藏
页码:1087 / 1097
页数:11
相关论文
共 50 条
[21]   A cross-lingual approach to automatic ICD-10 coding of death certificates by exploring machine translation [J].
Almagro, Mario ;
Martinez, Raquel ;
Montalvo, Soto ;
Fresno, Victor .
JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 94
[22]   Multilingual Test Sets for Machine Translation of Search Queries for Cross-Lingual Information Retrieval in the Medical Domain [J].
Uresova, Zdenka ;
Dusek, Ondrej ;
Hajic, Jan ;
Pecina, Pavel .
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, :3244-3247
[23]   Upper Bound for Cross-Lingual Concept Mapping with External Translation Resources [J].
Abu Helou, Mamoun ;
Palmonari, Matteo .
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2015, 2015, 9103 :424-431
[24]   Leveraging machine translation for cross-lingual fine-grained cyberbullying classification amongst pre-adolescents [J].
Verma, Kanishk ;
Popovi, Maja ;
Poulis, Alexandros ;
Cherkasova, Yelena ;
HObain, Cathal O. ;
Mazzone, Angela ;
Milosevic, Tijana ;
Davis, Brian .
NATURAL LANGUAGE ENGINEERING, 2023, 29 (06) :1458-1480
[25]   Cross-Lingual Summarization Method Based on Joint Training and Self-Training in Low-Resource Scenarios [J].
Cheng, Shaohuan ;
Tang, Yujia ;
Liu, Qiao ;
Chen, Wenyu .
Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2024, 53 (05) :762-770
[26]   A Workbench for Rapid Generation of Cross-Lingual Summaries [J].
Jhaveri, Nisarg ;
Gupta, Manish ;
Varma, Vasudeva .
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, :3209-3215
[27]   Transformer-based Cross-Lingual Summarization using Multilingual Word Embeddings for English - Bahasa Indonesia [J].
Abka, Achmad F. ;
Azizah, Kurniawati ;
Jatmiko, Wisnu .
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (12) :636-645
[28]   Neural-Network Lexical Translation for Cross-lingual IR from Text and Speech [J].
Zbib, Rabih ;
Zhao, Lingjun ;
Karakos, Damianos ;
Hartmann, William ;
DeYoung, Jay ;
Huang, Zhongqiang ;
Jiang, Zhuolin ;
Rivkin, Noah ;
Zhang, Le ;
Schwartz, Richard ;
Makhoul, John .
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, :645-654
[29]   Is Translation Helpful? An Exploration of Cross-Lingual Transfer in Low-Resource Dialog Generation [J].
Shen, Lei ;
Yu, Shuai ;
Shen, Xiaoyu .
2024 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN 2024, 2024,
[30]   Multimodal Cross-Lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-Stage Training Method [J].
Liu, Nayu ;
Wei, Kaiwen ;
Yang, Yong ;
Tao, Jianhua ;
Sun, Xian ;
Yao, Fanglong ;
Yu, Hongfeng ;
Jin, Li ;
Lv, Zhao ;
Fan, Cunhang .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) :10697-10714