Multi-granularity transformer fusion for temporal action localization

被引：0

作者：

Zhang M. ^{[1
]}

Hu H. ^{[2
]}

Li Z. ^{[2
]}

机构：

[1] Department of Design and Art, Zhejiang Industry Polytechnic College, Shaoxing

[2] School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou

来源：

Soft Computing | 2024年 / 28卷 / 20期

基金：

中国国家自然科学基金;

关键词：

Action recognition; Multi-granularity fusion; Temporal action localization; Transformer;

D O I：

10.1007/s00500-024-09955-x

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Temporal action localization plays a significant role in video understanding, which aims to recognize action category as well as temporal interval in untrimmed videos. Most of previous transformer-based methods employ a feature space of single-temporal granularity. However, low-level temporal features can not provide enough semantic information for action recognition while high-level temporal features lack rich details for boundary localization. To address the above issue, we propose a multi-granularity transformer fusion framework (MGTF) to localize temporal actions in videos. Specifically, the MGTF builds a multi-granularity feature fusion pipeline based on transformer, and uses a direct set prediction strategy to generate action instances. Through top-down cross-granularity attention interaction, the low-level features of boundary details and high-level semantic information can be combined to improve the feature discrimination. To reduce computation cost, we design temporal shift attention to adaptively focus on a sparse set of key segments. In addition, actionness regression head is utilized to refine the confidence score of different candidate instances. As a self-contained system, MGTF achieves state-of-the-art performance on THUMOS’14 and comparable performance on ActivityNet-1.3. Ablation studies and qualitative visualization also demonstrate the effectiveness of the proposed approach. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

引用

页码：12377 / 12388

页数：11

共 40 条

[1] Bai Y., Wang Y., Tong Y., Yang Y., Liu Q., Liu J., ) Boundary content graph neural network for temporal action proposal generation, In: European Conference on Computer Vision (ECCV, pp. 121-137, (2020)
[2] Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S., End-to-end object detection with transformers, In: European Conference on Computer Vision (ECCV, pp. 213-229, (2020)
[3] Carreira J., Zisserman A., : Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299-6308, (2017)
[4] Chao Y.W., Vijayanarasimhan S., Seybold B., Ross D.A., Deng J., Sukthankar R., Rethinking the faster R-CNN architecture for temporal action localization, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1130-1139, (2018)
[5] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Houlsby N., ) An image is worth 16 × 16 words: Transformers for image recognition at scale., (2020)
[6] Gao J., Shi Z., Wang G., Li J., Yuan Y., Ge S., Zhou X., Accurate temporal action proposal generation with relation-aware pyramid network, Proceedings of the AAAI conference on artificial intelligence, pp. 10810-10817, (2020)
[7] Heilbron F.C., Escorcia V., Ghanem B., Niebles J.C., Activitynet: A large-scale video benchmark for human activity understanding, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961-970, (2015)
[8] Hu M., Li Y., Fang L., Wang S., ) A2-FPN: Attention Aggregation Based Feature Pyramid Network for Instance Segmentation, pp. 15343-15352, (2021)
[9] Jain J., Li J., Chiu M., Hassani A., Orlov N., Shi H., ) OneFormer: One transformer to rule universal image segmentation., (2022)
[10] Jiang Y.-G., Liu J., Roshan Zamir A., Toderici G., Laptev I., Shah M., Sukthankar R., ) THUMOS Challenge: Action Recognition with a Large Number of Classes., (2014)

← 1 2 3 4 →