Attention Analysis and Calibration for Transformer in Natural Language Generation

被引：3

作者：

Lu, Yu ^{[1
,2
]}

Zhang, Jiajun ^{[1
,2
]}

Zeng, Jiali ^{[3
]}

Wu, Shuangzhi ^{[3
]}

Zong, Chengqing ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[3] Dept Tencent Cloud Xiaowei, Beijing 100089, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

关键词：

Graphics; Magnetization; Symbols; Magnetostatics; Speech processing; Permeability; Image color analysis; Attention mechanism; interpretability; Transformer; attention calibration;

D O I：

10.1109/TASLP.2022.3180678

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Attention mechanism has been ubiquitous in neural machine translation by dynamically selecting relevant contexts for different translations. Apart from performance gains, attention weights assigned to input tokens are often utilized to explain that high-attention tokens contribute more to the prediction. However, many works question whether this assumption holds in text classification by manually manipulating attention weights and observing decision flips. This article extends this question to Transformer-based neural machine translation, which heavily relies on cross-lingual attention to produce accurate translations but is relatively understudied in this context. We first design a mask perturbation model which automatically assesses each input's contribution to model outputs. We then test whether the token contributing most to the current translation receives the highest attention weight. We find that it sometimes does not, which closely depends on the entropy of attention weights, the syntactic role of the current generation, and language pairs. We also rethink the discrepancy between attention weights and word alignments from the view of unreliable attention weights. Our observations further motivate us to calibrate the cross-lingual multi-head attention by attaching more attention to indispensable tokens, whose removal leads to a dramatic performance drop. Empirical experiments on different-scale translation tasks and text summarization tasks demonstrate that our calibration methods significantly outperform strong baselines.

引用

页码：1927 / 1938

页数：12

共 56 条

[21]

Garg S, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P4453

[22]

Gehring J, 2017, PR MACH LEARN RES, V70

[23]

Ghader Hamidreza, 2017, P 8 INT JOINT C NAT, V1, P30

[24]

Gu JT, 2016, PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, P1631

[25]

Jain S, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P3543

[26]

Koehn Philipp, 2017, P 1 WORKSH NEUR MACH, P28, DOI DOI 10.18653/V1/W17-3204

[27]

Li HR, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P4152

[28]

Li XT, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P1293

[29]

Lin C.-Y., 2004, TEXT SUMMARIZATION B, P74

[30]

Liu Lemao, 2016, P COLING 2016 26 INT, P3093

← 1 2 3 4 5 6 →