Multi-granularity transformer fusion for temporal action localization

被引:0
作者
Zhang M. [1 ]
Hu H. [2 ]
Li Z. [2 ]
机构
[1] Department of Design and Art, Zhejiang Industry Polytechnic College, Shaoxing
[2] School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou
基金
中国国家自然科学基金;
关键词
Action recognition; Multi-granularity fusion; Temporal action localization; Transformer;
D O I
10.1007/s00500-024-09955-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action localization plays a significant role in video understanding, which aims to recognize action category as well as temporal interval in untrimmed videos. Most of previous transformer-based methods employ a feature space of single-temporal granularity. However, low-level temporal features can not provide enough semantic information for action recognition while high-level temporal features lack rich details for boundary localization. To address the above issue, we propose a multi-granularity transformer fusion framework (MGTF) to localize temporal actions in videos. Specifically, the MGTF builds a multi-granularity feature fusion pipeline based on transformer, and uses a direct set prediction strategy to generate action instances. Through top-down cross-granularity attention interaction, the low-level features of boundary details and high-level semantic information can be combined to improve the feature discrimination. To reduce computation cost, we design temporal shift attention to adaptively focus on a sparse set of key segments. In addition, actionness regression head is utilized to refine the confidence score of different candidate instances. As a self-contained system, MGTF achieves state-of-the-art performance on THUMOS’14 and comparable performance on ActivityNet-1.3. Ablation studies and qualitative visualization also demonstrate the effectiveness of the proposed approach. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
引用
收藏
页码:12377 / 12388
页数:11
相关论文
共 40 条
  • [1] Bai Y., Wang Y., Tong Y., Yang Y., Liu Q., Liu J., ) Boundary content graph neural network for temporal action proposal generation, In: European Conference on Computer Vision (ECCV, pp. 121-137, (2020)
  • [2] Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S., End-to-end object detection with transformers, In: European Conference on Computer Vision (ECCV, pp. 213-229, (2020)
  • [3] Carreira J., Zisserman A., : Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299-6308, (2017)
  • [4] Chao Y.W., Vijayanarasimhan S., Seybold B., Ross D.A., Deng J., Sukthankar R., Rethinking the faster R-CNN architecture for temporal action localization, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1130-1139, (2018)
  • [5] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Houlsby N., ) An image is worth 16 × 16 words: Transformers for image recognition at scale., (2020)
  • [6] Gao J., Shi Z., Wang G., Li J., Yuan Y., Ge S., Zhou X., Accurate temporal action proposal generation with relation-aware pyramid network, Proceedings of the AAAI conference on artificial intelligence, pp. 10810-10817, (2020)
  • [7] Heilbron F.C., Escorcia V., Ghanem B., Niebles J.C., Activitynet: A large-scale video benchmark for human activity understanding, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961-970, (2015)
  • [8] Hu M., Li Y., Fang L., Wang S., ) A2-FPN: Attention Aggregation Based Feature Pyramid Network for Instance Segmentation, pp. 15343-15352, (2021)
  • [9] Jain J., Li J., Chiu M., Hassani A., Orlov N., Shi H., ) OneFormer: One transformer to rule universal image segmentation., (2022)
  • [10] Jiang Y.-G., Liu J., Roshan Zamir A., Toderici G., Laptev I., Shah M., Sukthankar R., ) THUMOS Challenge: Action Recognition with a Large Number of Classes., (2014)