DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

被引:0
作者
Carmelo Scribano
Giorgia Franchini
Marco Prato
Marko Bertogna
机构
[1] University of Modena and Reggio Emilia,Department of Physics, Informatics and Mathematics
[2] University of Parma,Department of Mathematical, Physical and Computer Sciences
来源
Journal of Scientific Computing | 2023年 / 94卷
关键词
Transformers; Self-attention; Natural language processing; Deep learning; Discrete cosine transform; Frequencies domain;
D O I
暂无
中图分类号
学科分类号
摘要
Since their introduction the Transformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of “fully-attentive” architectures arises from the computation of the dot-product attention, which grows both in memory consumption and number of operations as O(n2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n^2)$$\end{document} where n stands for the input sequence length, thus limiting the applications that require modeling very long sequences. Several approaches have been proposed so far in the literature to mitigate this issue, with varying degrees of success. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module by leveraging the properties of the Discrete Cosine Transform. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time. Moreover, we assume that the results of our research might serve as a starting point for a broader family of deep neural models with reduced memory footprint. The implementation will be made publicly available at https://github.com/cscribano/DCT-Former-Public.
引用
收藏
相关论文
共 85 条
[1]  
Radford A(2019)Language models are unsupervised multitask learners OpenAI blog 1 9-1901
[2]  
Wu J(2020)Language models are few-shot learners Adv. Neural. Inf. Process. Syst. 33 1877-93
[3]  
Child R(1974)Discrete cosine transform IEEE Trans. Comput. 100 90-1780
[4]  
Luan D(1997)Long short-term memory Neural Comput. 9 1735-1564
[5]  
Amodei D(2008)Type-II/III DCT/DST algorithms with reduced number of arithmetic operations Signal Process. 88 1553-21309
[6]  
Sutskever I(2021)Soft: Softmax-free transformer with linear complexity Adv. Neural. Inf. Process. Syst. 34 21297-22482
[7]  
Brown T(2021)Combiner: Full attention transformer with sparse computation cost Adv. Neural. Inf. Process. Syst. 34 22470-29463
[8]  
Mann B(2021)Fmmformer: Efficient and flexible transformer via decomposed near-field and far-field attention Adv. Neural. Inf. Process. Syst. 34 29449-9907
[9]  
Ryder N(2021)Sparse is enough in scaling transformers Adv. Neural. Inf. Process. Syst. 34 9895-17736
[10]  
Subbiah M(2021)Long-short transformer: Efficient transformers for language and vision Adv. Neural. Inf. Process. Syst. 34 17723-17426