Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

被引:0
|
作者
Tsai, Yao-Hung Hubert [1 ]
Bai, Shaojie [1 ]
Yamada, Makoto [3 ,4 ]
Morency, Louis-Philippe [2 ]
Salakhutdinov, Ruslan [1 ]
机构
[1] Carnegie Mellon Univ, Machine Learning Dept, Pittsburgh, PA 15213 USA
[2] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA
[3] Kyoto Univ, Kyoto, Japan
[4] RIKEN AIP, Wako, Saitama, Japan
来源
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE | 2019年
基金
美国国家卫生研究院;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.
引用
收藏
页码:4344 / 4353
页数:10
相关论文
共 50 条
  • [31] UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration
    Zhang, Jingyi
    Huang, Jiaxing
    Zhang, Xiaoqing
    Lu, Shijian
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 11227 - 11237
  • [32] A unified framework for unsupervised action learning via global-to-local motion transformer
    Kim, Boeun
    Kim, Jungho
    Chang, Hyung Jin
    Oh, Tae-Hyun
    PATTERN RECOGNITION, 2025, 159
  • [33] A transformer-based unified multimodal framework for Alzheimer's disease assessment
    Department of Big Data in Health Science, School of Public Health and Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang, Hangzhou, China
    不详
    130024, China
    Comput. Biol. Med.,
  • [34] S2WAT: Image Style Transfer via Hierarchical Vision Transformer Using StripsWindow Attention
    Zhang, Chiyu
    Xu, Xiaogang
    Wang, Lei
    Dai, Zaiyan
    Yang, Jun
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7024 - 7032
  • [35] Hybrid attention transformer with re-parameterized large kernel convolution for image super-resolution
    Ma, Zhicheng
    Liu, Zhaoxiang
    Wang, Kai
    Lian, Shiguo
    IMAGE AND VISION COMPUTING, 2024, 149
  • [36] Understanding the PULSAR effect in combined radiotherapy and immunotherapy using transformer-based attention mechanisms
    Peng, Hao
    Moore, Casey
    Saha, Debabrata
    Jiang, Steve
    Timmerman, Robert
    FRONTIERS IN ONCOLOGY, 2024, 14
  • [37] ColorFormer: Image Colorization via Color Memory Assisted Hybrid-Attention Transformer
    Ji, Xiaozhong
    Jiang, Boyuan
    Luo, Donghao
    Tao, Guangpin
    Chu, Wenqing
    Xie, Zhifeng
    Wang, Chengjie
    Tai, Ying
    COMPUTER VISION - ECCV 2022, PT XVI, 2022, 13676 : 20 - 36
  • [38] Raster-to-Graph: Floorplan Recognition via Autoregressive Graph Prediction with an Attention Transformer
    Hu, Sizhe
    Wu, Wenming
    Su, Ruolin
    Hou, Wanni
    Zheng, Liping
    Xu, Benzhu
    COMPUTER GRAPHICS FORUM, 2024, 43 (02)
  • [39] PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation
    Xia, Chenxing
    Duan, Xiuzhen
    Gao, Xiuju
    Ge, Bin
    Li, Kuan-Ching
    Fang, Xianjin
    Zhang, Yan
    Yang, Ke
    NEURAL PROCESSING LETTERS, 2024, 56 (02)
  • [40] Modify Self-Attention via Skeleton Decomposition for Effective Point Cloud Transformer
    Han, Jiayi
    Zeng, Longbin
    Du, Liang
    Ye, Xiaoqing
    Ding, Weiyang
    Feng, Jianfeng
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 808 - 816