CU Partition Mode Decision for HEVC Hardwired Intra Encoder Using Convolution Neural Network

被引:151
作者
Liu, Zhenyu [1 ]
Yu, Xianyu [2 ]
Gao, Yuan [3 ]
Chen, Shaolin [5 ]
Ji, Xiangyang [4 ]
Wang, Dongsheng [1 ]
机构
[1] Tsinghua Univ, Tsinghua Natl Lab Informat Sci & Technol, Res Inst Informat Technol, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Inst Microelect, Beijing 100084, Peoples R China
[3] Tsinghua Univ, Dept Comp Sci, Beijing 100084, Peoples R China
[4] Tsinghua Univ, Dept Automat, Beijing 100084, Peoples R China
[5] Huawei Technol Co Ltd, Shenzhen 518129, Peoples R China
基金
中国国家自然科学基金;
关键词
HEVC; fast CU/PU mode decision; CNN; VLSI; intra encoding; RATE-DISTORTION OPTIMIZATION; SIZE DECISION; ARCHITECTURE DESIGN; EFFICIENCY; ALGORITHM;
D O I
10.1109/TIP.2016.2601264
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The intensive computation of High Efficiency Video Coding (HEVC) engenders challenges for the hardwired encoder in terms of the hardware overhead and the power dissipation. On the other hand, the constrains in hardwired encoder design seriously degrade the efficiency of software oriented fast coding unit (CU) partition mode decision algorithms. A fast algorithm is attributed as VLSI friendly, when it possesses the following properties. First, the maximum complexity of encoding a coding tree unit (CTU) could be reduced. Second, the parallelism of the hardwired encoder should not be deteriorated. Third, the process engine of the fast algorithm must be of low hardware-and power-overhead. In this paper, we devise the convolution neural network based fast algorithm to decrease no less than two CU partition modes in each CTU for full rate-distortion optimization (RDO) processing, thereby reducing the encoder's hardware complexity. As our algorithm does not depend on the correlations among CU depths or spatially nearby CUs, it is friendly to the parallel processing and does not deteriorate the rhythm of RDO pipelining. Experiments illustrated that, an averaged 61.1% intraencoding time was saved, whereas the Bjontegaard-Delta bit-rate augment is 2.67%. Capitalizing on the optimal arithmetic representation, we developed the high-speed [714 MHz in the worst conditions (125 degrees C, 0.9 V)] and low-cost (42.5k gate) accelerator for our fast algorithm by using TSMC 65-nm CMOS technology. One accelerator could support HD1080p at 55 frames/s real-time encoding. The corresponding power dissipation was 16.2 mW at 714 MHz. Finally, our accelerator is provided with good scalability. Four accelerators fulfill the throughput requirements of UltraHD-4K at 55 frames/s.
引用
收藏
页码:5088 / 5103
页数:16
相关论文
共 43 条
[1]  
[Anonymous], 2001, VCEGM33
[2]  
[Anonymous], 2012, document JCTVC-H1100 of JCT-VC
[3]  
[Anonymous], 2010, JCTVCC207
[4]  
Berger T, 1971, Rate Distortion Theory. A Mathematical Basis for Data Compression
[5]  
Bross B., 2013, P 11 JCT VC M
[6]   Analysis and architecture design of variable block-size motion estimation for H.264/AVC [J].
Chen, CY ;
Chien, SY ;
Huang, YW ;
Chen, TC ;
Wang, TC ;
Chen, LG .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2006, 53 (03) :578-593
[7]  
Chen Q., 2004, P PICT COD S PCS 200, P133
[8]   DaDianNao: A Machine-Learning Supercomputer [J].
Chen, Yunji ;
Luo, Tao ;
Liu, Shaoli ;
Zhang, Shijin ;
He, Liqiang ;
Wang, Jia ;
Li, Ling ;
Chen, Tianshi ;
Xu, Zhiwei ;
Sun, Ninghui ;
Temam, Olivier .
2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, :609-622
[9]   Fast CU Splitting and Pruning for Suboptimal CU Partitioning in HEVC Intra Coding [J].
Cho, Seunghyun ;
Kim, Munchurl .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2013, 23 (09) :1555-1564
[10]  
Choi K., 2011, document JCTVC-F092 of JCT-VC