Integration of Global and Local Knowledge for Foreground Enhancing in Weakly Supervised Temporal Action Localization

被引：3

作者：

Zhang, Tianyi ^{[1
]}

Li, Ronglu ^{[2
]}

Feng, Pengming ^{[3
]}

Zhang, Rubo ^{[2
]}

机构：

[1] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China

[2] Dalian Minzu Univ, Coll Mech & Elect Engn, Dalian 116600, Peoples R China

[3] CAST, State Key Lab Space Ground Integrated Informat Tec, Beijing 100095, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2024年 / 26卷

基金：

中国国家自然科学基金;

关键词：

Weakly supervised learning; temporal action localization; video content analysis; EVENT DETECTION;

D O I：

10.1109/TMM.2024.3379887

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Weakly Supervised Temporal Action Localization (WTAL) aims to identify the temporal duration of actions and classify the action categories with only video-level labels in the training stage. Motivated by the intuition that the attention maps generated from various views will assist in enhancing the foreground action temporal segments, in this paper we propose a WTAL pipeline based on a novel attention mechanism that effectively integrates global and local knowledge. Our attention mechanism is mainly composed of a global attention branch and a local attention branch. Specifically, the global attention branch is built on the inter-segment similarity to sparsely mine out the correlation knowledge within the entire video, while the local attention branch is built on the convolutional structure to densely aggregate the information within the fixed local respective field. Experiments on THUMOS14 and ActivityNet v1.3 datasets demonstrate the effectiveness of our proposed WTAL pipeline compared to state-of-the-art methods.

引用

页码：8476 / 8487

页数：12

共 62 条

[1]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[3] Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].

Chao, Yu-Wei ;

Vijayanarasimhan, Sudheendra ;

Seybold, Bryan ;

Ross, David A. ;

Deng, Jia ;

Sukthankar, Rahul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139

[4] Large-Scale Vehicle Detection, Indexing, and Search in Urban Surveillance Videos [J].

Feris, Rogerio Schmidt ;

Siddiquie, Behjat ;

Petterson, James ;

Zhai, Yun ;

Datta, Ankur ;

Brown, Lisa M. ;

Pankanti, Sharath .

IEEE TRANSACTIONS ON MULTIMEDIA, 2012, 14 (01) :28-42

[5] Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization [J].

Gao, Junyu ;

Chen, Mengyuan ;

Xu, Changsheng .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :19967-19977

[6]

Ghanem B., 2017, BMVC, V1, P2

[7] Cross-modal Consensus Network forWeakly Supervised Temporal Action Localization [J].

Hong, Fa-Ting ;

Feng, Jia-Chang ;

Xu, Dan ;

Shan, Ying ;

Zheng, Wei-Shi .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :1591-1599

[8] Local Relation Networks for Image Recognition [J].

Hu, Han ;

Zhang, Zheng ;

Xie, Zhenda ;

Lin, Stephen .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3463-3472

[9]

Hu J, 2018, ADV NEUR IN, V31

[10]

Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]

← 1 2 3 4 5 6 7 →