Neural Video Compression with Spatio-Temporal Cross-Covariance Transformers

被引:3
|
作者
Chen, Zhenghao [1 ,3 ]
Relic, Lucas [2 ]
Azevedo, Roberto [3 ]
Zhang, Yang [3 ]
Gross, Markus [2 ]
Xu, Dong [4 ]
Zhou, Luping [1 ]
Schroers, Christopher [3 ]
机构
[1] Univ Sydney, Sydney, NSW, Australia
[2] Swiss Fed Inst Technol, Zurich, Switzerland
[3] DisneyRes Studios, Zurich, Switzerland
[4] Univ Hong Kong, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
Video compression; neural network; transformer;
D O I
10.1145/3581783.3611960
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although existing neural video compression (NVC) methods have achieved significant success, most of them focus on improving either temporal or spatial information separately. They generally use simple operations such as concatenation or subtraction to utilize this information, while such operations only partially exploit spatio-temporal redundancies. This work aims to effectively and jointly leverage robust temporal and spatial information by proposing a new 3D-based transformer module: Spatio-Temporal Cross-Covariance Transformer (ST-XCT). The ST-XCT module combines two individual extracted features into a joint spatio-temporal feature, followed by 3D convolutional operations and a novel spatio-temporal-aware cross-covariance attention mechanism. Unlike conventional transformers, the cross-covariance attention mechanism is applied across the feature channels without breaking down the spatio-temporal features into local tokens. Such design allows for modeling global cross-channel correlations of the spatio-temporal context while lowering the computational requirement. Based on ST-XCT, we introduce a novel transformer-based end-to-end optimized NVC framework. ST-XCT-based modules are integrated into various key coding components of NVC, such as feature extraction, frame reconstruction, and entropy modeling, demonstrating its generalizability. Extensive experiments show that our ST-XCT-based NVC proposal achieves state-of-the-art compression performances on various standard video benchmark datasets.
引用
收藏
页码:8543 / 8551
页数:9
相关论文
共 50 条
  • [1] Video Compression Based on Spatio-Temporal Resolution Adaptation
    Afonso, Mariana
    Zhang, Fan
    Bull, David R.
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (01) : 275 - 280
  • [2] Human-Centric Spatio-Temporal Video Grounding With Visual Transformers
    Tang, Zongheng
    Liao, Yue
    Liu, Si
    Li, Guanbin
    Jin, Xiaojie
    Jiang, Hongxu
    Yu, Qian
    Xu, Dong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) : 8238 - 8249
  • [3] Spatio-temporal constrained tone mapping operator for HDR video compression
    Ozcinar, Cagri
    Lauga, Paul
    Valenzise, Giuseppe
    Dufaux, Frederic
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2018, 55 : 166 - 178
  • [4] Video Fingerprint Algorithm Based on Spatio-Temporal Deep Neural Network
    Wang Dongdong
    Li Yuenan
    LASER & OPTOELECTRONICS PROGRESS, 2018, 55 (01)
  • [5] High performance holographic video compression using spatio-temporal phase unwrapping
    Gonzalez, Sorayda Trejos
    Velez-Zea, Alejandro
    Barrera-Ramirez, John Fredy
    OPTICS AND LASERS IN ENGINEERING, 2024, 181
  • [6] End-to-End Learning of Video Compression Using Spatio-Temporal Autoencoders
    Pessoa, Jorge
    Aidos, Helena
    Tomas, Pedro
    Figueiredo, Mario A. T.
    2020 IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS (SIPS), 2020, : 276 - 281
  • [7] Cross-scale hierarchical spatio-temporal transformer for video enhancement
    Jiang, Qin
    Wang, Qinglin
    Chi, Lihua
    Liu, Jie
    KNOWLEDGE-BASED SYSTEMS, 2025, 309
  • [8] Spatio-Temporal Perturbations for Video Attribution
    Li, Zhenqiang
    Wang, Weimin
    Li, Zuoyue
    Huang, Yifei
    Sato, Yoichi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2043 - 2056
  • [9] SPATIO-TEMPORAL VIDEO FILTERING FOR VIDEO SURVEILLANCE APPLICATIONS
    Ben Hamida, Amal
    Koubaa, Mohamed
    Nicolas, Henri
    Ben Amar, Chokri
    ELECTRONIC PROCEEDINGS OF THE 2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2013,
  • [10] Diverse Video Captioning by Adaptive Spatio-temporal Attention
    Ghaderi, Zohreh
    Salewski, Leonard
    Lensch, Hendrik P. A.
    PATTERN RECOGNITION, DAGM GCPR 2022, 2022, 13485 : 409 - 425