DIFFERENTIABLE RESOLUTION COMPRESSION AND ALIGNMENT FOR EFFICIENT VIDEO CLASSIFICATION AND RETRIEVAL

被引:2
作者
Deng, Rui [1 ]
Wu, Qian [1 ]
Li, Yuke [1 ]
Fu, Haoran [2 ]
机构
[1] NetEase Yidun AI Lab, Hangzhou, Peoples R China
[2] Zhejiang Univ, Hangzhou, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024 | 2024年
关键词
Dynamic Video Inference;
D O I
10.1109/ICASSP48485.2024.10446442
中图分类号
学科分类号
摘要
Optimizing video inference efficiency has become increasingly important with the growing demand for video analysis in various fields. Some existing methods achieve high efficiency by explicit discard of spatial or temporal information, which poses challenges in fast-changing and fine-grained scenarios. To address these issues, we propose an efficient video representation network with Differentiable Resolution Compression and Alignment, which compresses non-essential information in the early stage of the network to reduce computational costs while maintaining consistent temporal correlations. Specifically, we leverage a Differentiable Context-aware Compression Module to encode the saliency and non-saliency frame features, refining and updating the features into a high-low resolution video sequence. To process the new sequence, we introduce a new Resolution-Align Transformer Layer to capture global temporal correlations among frame features with different resolutions, while reducing spatial computation costs quadratically by utilizing fewer spatial tokens in low-resolution non-saliency frames. The entire network can be end-to-end optimized via the integration of the differentiable compression module. Experimental results show that our method achieves the best trade-off between efficiency and performance on near-duplicate video retrieval and competitive results on dynamic video classification compared to state-of-the-art methods.Code:https://github.com/dun-research/DRCA
引用
收藏
页码:3200 / 3204
页数:5
相关论文
共 28 条
  • [1] Bertasius G, 2021, PR MACH LEARN RES, V139
  • [2] Berthet Q., 2020, ADV NEURAL INFORM PR, V33, P9508
  • [3] Cai Bolun, 2022, CVPR
  • [4] 3D-CSL: SELF-SUPERVISED 3D CONTEXT SIMILARITY LEARNING FOR NEAR-DUPLICATE VIDEO RETRIEVAL
    Deng, Rui
    Wu, Qian
    Li, Yuke
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2880 - 2884
  • [5] Dosovitskiy Alexey, 2021, CVPR
  • [6] SlowFast Networks for Video Recognition
    Feichtenhofer, Christoph
    Fan, Haoqi
    Malik, Jitendra
    He, Kaiming
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210
  • [7] Gowda SN, 2021, AAAI CONF ARTIF INTE, V35, P1451
  • [8] He K., 2015, P IEEE C COMP VIS PA, DOI [10.1109/CVPR.2016.90, DOI 10.1109/CVPR.2016.90]
  • [9] Learn from Unlabeled Videos for Near-duplicate Video Retrieval
    He, Xiangteng
    Pan, Yulin
    Tang, Mingqian
    Lv, Yiliang
    Peng, Yuxin
    [J]. PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1002 - 1011
  • [10] Jang E., 2016, ARXIV