VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision

被引:9
作者
Sheng, Xihua [1 ]
Li, Li [1 ]
Liu, Dong [1 ]
Li, Houqiang [1 ]
机构
[1] Univ Sci & Technol China, CAS Key Lab Technol Geospatial Informat Proc & App, Hefei 230027, Peoples R China
关键词
Streaming media; Task analysis; Machine vision; Image reconstruction; Video coding; Decoding; Video codecs; Deep neural network; human and machine vision; neural video coding; video enhancement; video analysis;
D O I
10.1109/TPAMI.2024.3356548
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Almost all digital videos are coded into compact representations before being transmitted. Such compact representations need to be decoded back to pixels before being displayed to humans and - as usual - before being enhanced/analyzed by machine vision algorithms. Intuitively, it is more efficient to enhance/analyze the coded representations directly without decoding them into pixels. Therefore, we propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis, thereby being versatile for both human and machine vision. Our VNVC framework has a feature-based compression loop. In the loop, one frame is encoded into compact representations and decoded to an intermediate feature that is obtained before performing reconstruction. The intermediate feature can be used as reference in motion compensation and motion estimation through feature-based temporal context mining and cross-domain motion encoder-decoder to compress the following frames. The intermediate feature is directly fed into video reconstruction, video enhancement, and video analysis networks to evaluate its effectiveness. The evaluation shows that our framework with the intermediate feature achieves high compression efficiency for video reconstruction and satisfactory task performances with lower complexities.
引用
收藏
页码:4579 / 4596
页数:18
相关论文
共 77 条
[1]   The JPEG AI Standard: Providing Efficient Human and Machine Visual Data Consumption [J].
Ascenso, Joao ;
Alshina, Elena ;
Ebrahimi, Touradj .
IEEE MULTIMEDIA, 2023, 30 (01) :100-111
[2]  
Ascenso Joao, 2021, ISO/IEC JTC1/SC29/WG1 M90014
[3]  
Balle J., 2018, PROC INT C LEARN REP
[4]  
Balle J., 2017, INT C LEARNING REPRE
[5]  
Barnett T., 2018, White Paper, P1
[6]   Overview of the Versatile Video Coding (VVC) Standard and its Applications [J].
Bross, Benjamin ;
Wang, Ye-Kui ;
Ye, Yan ;
Liu, Shan ;
Chen, Jianle ;
Sullivan, Gary J. ;
Ohm, Jens-Rainer .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (10) :3736-3764
[7]   One-Shot Video Object Segmentation [J].
Caelles, S. ;
Maninis, K. -K. ;
Pont-Tuset, J. ;
Leal-Taixe, L. ;
Cremers, D. ;
Van Gool, L. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5320-5329
[8]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[9]   Toward Intelligent Sensing: Intermediate Deep Feature Compression [J].
Chen, Zhuo ;
Fan, Kui ;
Wang, Shiqi ;
Duan, Lingyu ;
Lin, Weisi ;
Kot, Alex Chichung .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :2230-2243
[10]   Scalable Video Coding for Humans and Machines [J].
Choi, Hyomin ;
Bajic, Ivan, V .
2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,