Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

被引：0

作者：

Tsai, Yao-Hung Hubert ^{[1
]}

Bai, Shaojie ^{[1
]}

Yamada, Makoto ^{[3
,4
]}

Morency, Louis-Philippe ^{[2
]}

Salakhutdinov, Ruslan ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA

[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA

[3] Kyoto Univ, Kyoto, Japan

[4] RIKEN AIP, Wako, Saitama, Japan

来源：

2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE | 2019年

基金：

美国国家卫生研究院;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.

引用

页码：4344 / 4353

页数：10

共 50 条

[31] UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration
Zhang, Jingyi
Huang, Jiaxing
Zhang, Xiaoqing
Lu, Shijian
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11227 - 11237
[32] A unified framework for unsupervised action learning via global-to-local motion transformer
Kim, Boeun
Kim, Jungho
Chang, Hyung Jin
Oh, Tae-Hyun
PATTERN RECOGNITION, 2025, 159
[33] A transformer-based unified multimodal framework for Alzheimer's disease assessment
Department of Big Data in Health Science, School of Public Health and Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang, Hangzhou, China
不详
130024, China
Comput. Biol. Med.,
[34] S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using StripsWindow Attention
Zhang, Chiyu
Xu, Xiaogang
Wang, Lei
Dai, Zaiyan
Yang, Jun
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7024 - 7032
[35] Hybrid attention transformer with re-parameterized large kernel convolution for image super-resolution
Ma, Zhicheng
Liu, Zhaoxiang
Wang, Kai
Lian, Shiguo
IMAGE AND VISION COMPUTING, 2024, 149
[36] Understanding the PULSAR effect in combined radiotherapy and immunotherapy using transformer-based attention mechanisms
Peng, Hao
Moore, Casey
Saha, Debabrata
Jiang, Steve
Timmerman, Robert
FRONTIERS IN ONCOLOGY, 2024, 14
[37] ColorFormer: Image Colorization via Color Memory Assisted Hybrid-Attention Transformer
Ji, Xiaozhong
Jiang, Boyuan
Luo, Donghao
Tao, Guangpin
Chu, Wenqing
Xie, Zhifeng
Wang, Chengjie
Tai, Ying
COMPUTER VISION - ECCV 2022, PT XVI, 2022, 13676 : 20 - 36
[38] Raster-to-Graph: Floorplan Recognition via Autoregressive Graph Prediction with an Attention Transformer
Hu, Sizhe
Wu, Wenming
Su, Ruolin
Hou, Wanni
Zheng, Liping
Xu, Benzhu
COMPUTER GRAPHICS FORUM, 2024, 43 (02)
[39] PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation
Xia, Chenxing
Duan, Xiuzhen
Gao, Xiuju
Ge, Bin
Li, Kuan-Ching
Fang, Xianjin
Zhang, Yan
Yang, Ke
NEURAL PROCESSING LETTERS, 2024, 56 (02)
[40] Modify Self-Attention via Skeleton Decomposition for Effective Point Cloud Transformer
Han, Jiayi
Zeng, Longbin
Du, Liang
Ye, Xiaoqing
Ding, Weiyang
Feng, Jianfeng
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 808 - 816

← 1 2 3 4 5 →