Anchor Attention, Small Cache: Code Generation With Large Language Models

被引:0
作者
Zhang, Xiangyu [1 ]
Zhou, Yu [1 ]
Yang, Guang [1 ]
Gall, Harald C. [2 ]
Chen, Taolue [3 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 211106, Peoples R China
[2] Univ Zurich, CH-8050 Zurich, Switzerland
[3] Birkbeck Univ London, Sch Comp & Math Sci, London WC1E 7HX, England
基金
中国国家自然科学基金;
关键词
Codes; Attention mechanisms; Semantics; Decoding; Context modeling; Transformers; Needles; Large language models; Data mining; Computational modeling; Code generation; attention mechanism; transformers; large language models;
D O I
10.1109/TSE.2025.3570680
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of AnchorCoder, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
引用
收藏
页码:1866 / 1881
页数:16
相关论文
共 72 条
[1]  
Ainslie J, 2023, Arxiv, DOI arXiv:2305.13245
[2]  
Allal LB, 2023, Arxiv, DOI arXiv:2301.03988
[3]  
Austin Jacob., 2021, arXiv
[4]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[5]  
Beltagy I, 2020, Arxiv, DOI [arXiv:2004.05150, 10.48550/arXiv.2004.05150]
[6]  
Bi X, 2024, Arxiv, DOI [arXiv:2401.02954, 10.48550/arXiv.2401.02954, DOI 10.48550/ARXIV.2304.05332]
[7]  
Chen B., 2021, Advances in Neural Information Processing Systems, V34, P17413, DOI DOI 10.48550/ARXIV.2110.15343
[8]  
Chen M., 2021, Evaluating large language models trained on code, DOI DOI 10.48550/ARXIV.2107.03374
[9]  
Chen SY, 2023, Arxiv, DOI arXiv:2306.15595
[10]  
Chevalier A, 2023, Arxiv, DOI arXiv:2305.14788