Towards Efficient Inference on Mobile Device via Pruning

被引:0
作者
Wang, Zhiyuan [1 ]
Tan, Haisheng [1 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei, Peoples R China
来源
2024 10TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS, BIGCOM | 2024年
关键词
inference; edge intelligence; mobile device; pruning;
D O I
10.1109/BIGCOM65357.2024.00013
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Mobile edge intelligence requires deploying deep learning neural network models on mobile devices. Model sparsity techniques, including low-precision quantization and pruning, have been proposed to achieve higher model compression rates at a lower cost of model accuracy degradation. However, these sparsity techniques still face challenges in deployment on mobile devices. Due to the lack of capability for fine-tuning model weight parameters in mobile inference frameworks, any pruning technique must be carefully applied. The deployment of sparsity techniques in mobile inference frameworks is challenging. To address these challenges, we propose PoI (Pruning on Inference), the inference framework, which integrates lowprecision quantization and pruning optimization for inference computation, enabling efficient deployment of deep learning models on mobile devices. The design of this framework is based on the observation that memory access patterns have a crucial impact on inference latency during model execution. This provides us with an opportunity to perform efficient pruning operations during inference execution. To realize this idea, we have designed an inference computation framework. Firstly, it supports different quantization bit-widths to fully exploit the computational capabilities of different hardware devices. After the user specifies the inference model, it checks the current device's support for various quantization bit-widths and sets the corresponding compilation optimization options. During the actual execution process, it can select an appropriate pruning ratio based on the user-specified tolerance for accuracy loss. In the specific operator computation execution process, it performs pruning operations on the corresponding ratio of intra-kernel computations. At the same time, this pruning takes into account the potential impact on memory access patterns to minimize the additional overhead caused by cache misses. We evaluated our implementation based on commonly used computer vision neural network models and popular mobile devices in the real world. Compared to current mainstream solutions, our implementation can reduce the average inference latency of models. Moreover, our implementation can be easily migrated to other popular mobile devices.
引用
收藏
页码:26 / 33
页数:8
相关论文
共 30 条
[1]  
Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, DOI 10.48550/ARXIV.2005.14165]
[2]  
Chen TQ, 2018, PROCEEDINGS OF THE 13TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P579
[3]  
Courbariaux M, 2016, Arxiv, DOI arXiv:1602.02830
[4]  
Courbariaux M, 2015, ADV NEUR IN, V28
[5]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[6]  
Developer.arm, ARM architecture reference manual supplement memory system resource partitioning and monitoring (mpam) for ARMv8-A
[7]  
Elthakeb Ahmed, 2018, NEURIPS ML SYST WORK
[8]  
Fedus W, 2022, Arxiv, DOI arXiv:2101.03961
[9]  
google, TensorFlow Lite
[10]  
Han S, 2015, Arxiv, DOI [arXiv:1506.02626, 10.48550/ARXIV.1506.02626, DOI 10.1145/276675.276685]