共 20 条
[1]
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
[J].
SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS,
2022,
[2]
[Anonymous], 2024, Huggingface Huggingface: RoBERTa
[3]
[Anonymous], 2024, vllm-project: vLLM: Easy, fast, and cheap LLM serving for everyone
[4]
Bai Junjie., 2019, Onnx: Open neural network exchange
[5]
Dao T, 2022, ADV NEUR IN
[6]
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]
Gerganov G, 2023, GGML-tensor library for machine learning
[8]
Gerganov G., 2023, Inference of LLaMA model in pure C/C++
[9]
Efficient Memory Management for Large Language Model Serving with PagedAttention
[J].
PROCEEDINGS OF THE TWENTY-NINTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, SOSP 2023,
2023,
:611-626
[10]
Li SG, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, P2391