共 85 条
[1]
Radford A(2019)Language models are unsupervised multitask learners OpenAI blog 1 9-1901
[2]
Wu J(2020)Language models are few-shot learners Adv. Neural. Inf. Process. Syst. 33 1877-93
[3]
Child R(1974)Discrete cosine transform IEEE Trans. Comput. 100 90-1780
[4]
Luan D(1997)Long short-term memory Neural Comput. 9 1735-1564
[5]
Amodei D(2008)Type-II/III DCT/DST algorithms with reduced number of arithmetic operations Signal Process. 88 1553-21309
[6]
Sutskever I(2021)Soft: Softmax-free transformer with linear complexity Adv. Neural. Inf. Process. Syst. 34 21297-22482
[7]
Brown T(2021)Combiner: Full attention transformer with sparse computation cost Adv. Neural. Inf. Process. Syst. 34 22470-29463
[8]
Mann B(2021)Fmmformer: Efficient and flexible transformer via decomposed near-field and far-field attention Adv. Neural. Inf. Process. Syst. 34 29449-9907
[9]
Ryder N(2021)Sparse is enough in scaling transformers Adv. Neural. Inf. Process. Syst. 34 9895-17736
[10]
Subbiah M(2021)Long-short transformer: Efficient transformers for language and vision Adv. Neural. Inf. Process. Syst. 34 17723-17426