Exploiting semantic-level affinities with a mask-guided network for temporal action proposal in videos

被引：1

作者：

Yang, Yu ^{[1
]}

Wang, Mengmeng ^{[1
]}

Mei, Jianbiao ^{[1
]}

Liu, Yong ^{[1
]}

机构：

[1] Zhejiang Univ, Inst Cyber Syst & Control, Hangzhou 310027, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 12期

关键词：

Temporal action proposal generation; Temporal action localization; Attention; Transformer;

D O I：

10.1007/s10489-022-04261-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Temporal action proposal (TAP) aims to detect the action instances' starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

引用

页码：15516 / 15536

页数：21

共 63 条

[1] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[2] Boundary Content Graph Neural Network for Temporal Action Proposal Generation [J].

Bai, Yueran ;

Wang, Yingying ;

Tong, Yunhai ;

Yang, Yang ;

Liu, Qiyue ;

Liu, Junhui .

COMPUTER VISION - ECCV 2020, PT XXVIII, 2020, 12373 :121-137

[3]

Bertasius G, 2021, PR MACH LEARN RES, V139

[4]

Buch S ..., 2019, BRIT MACHINE VISION, DOI [DOI 10.5244/C.31.93, 10.5244/C.31.93]

[5] SST: Single-Stream Temporal Action Proposals [J].

Buch, Shyamal ;

Escorcia, Victor ;

Shen, Chuanqi ;

Ghanem, Bernard ;

Niebles, Juan Carlos .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6373-6382

[6] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[7] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[8] Examining Childhood Adversities in Chinese Health Science Students Using the Simplified Chinese Version of the Adverse Childhood Experiences-International Questionnaire (SC-ACE-IQ) [J].

Chen, Wenyi ;

Yu, Zhiyuan ;

Wang, Lin ;

Gross, Deborah .

ADVERSITY AND RESILIENCE SCIENCE, 2022, 3 (04) :335-346

[9] KFC: An Efficient Framework for Semi-Supervised Temporal Action Localization [J].

Ding, Xinpeng ;

Wang, Nannan ;

Gao, Xinbo ;

Li, Jie ;

Wang, Xiaoyu ;

Liu, Tongliang .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :6869-6878

[10] Linear dynamical systems approach for human action recognition with dual-stream deep features [J].

Du, Zhouning ;

Mukaidani, Hiroaki .

APPLIED INTELLIGENCE, 2022, 52 (01) :452-470

← 1 2 3 4 5 6 7 →