From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers

被引:0
|
作者
Choromanski, Krzysztof [1 ,2 ]
Lin, Han [2 ]
Chen, Haoxian [2 ]
Zhang, Tianyi [2 ]
Sehanobish, Arijit
Likhosherstov, Valerii [3 ]
Parker-Holder, Jack [4 ]
Sarlos, Tamas [5 ]
Weller, Adrian [3 ,6 ]
Weingarten, Thomas [7 ]
机构
[1] Google Brain Robot, Mountain View, CA 94043 USA
[2] Columbia Univ, New York, NY 10027 USA
[3] Univ Cambridge, Cambridge, England
[4] Univ Oxford, Oxford, England
[5] Google Res, Mountain View, CA USA
[6] Alan Turing Inst, Mountain View, CA USA
[7] Google, Mountain View, CA USA
来源
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162 | 2022年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPEattention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.
引用
收藏
页数:22
相关论文
共 7 条