Exploiting semantic-level affinities with a mask-guided network for temporal action proposal in videos

被引：0

作者：

Yu Yang

Mengmeng Wang

Jianbiao Mei

Yong Liu

机构：

[1] Zhejiang University,Institute of Cyber

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Temporal action proposal generation; Temporal action localization; Attention; Transformer;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Temporal action proposal (TAP) aims to detect the action instances’ starting and ending times in untrimmed videos, which is fundamental and critical for large-scale video analysis and human action understanding. The main challenge of the temporal action proposal lies in modeling representative temporal relations in long untrimmed videos. Existing state-of-the-art methods achieve temporal modeling by building local-level, proposal-level, or global-level temporal dependencies. Local methods lack a wider receptive field, while proposal and global methods lack the focalization of learning action frames and contain background distractions. In this paper, we propose that learning semantic-level affinities can capture more practical information. Specifically, by modeling semantic associations between frames and action units, action segments (foregrounds) can aggregate supportive cues from other co-occurring actions, and nonaction clips (backgrounds) can learn the discriminations between them and action frames. To this end, we propose a novel framework named the Mask-Guided Network (MGNet) to build semantic-level temporal associations for the TAP task. Specifically, we first propose a Foreground Mask Generation (FMG) module to adaptively generate the foreground mask, representing the locations of the action units throughout the video. Second, we design a Mask-Guided Transformer (MGT) by exploiting the foreground mask to guide the self-attention mechanism to focus on and calculate semantic affinities with the foreground frames. Finally, these two modules are jointly explored in a unified framework. MGNet models the intra-semantic similarities for foregrounds, extracting supportive action cues for boundary refinement; it also builds the inter-semantic distances for backgrounds, providing the semantic gaps to suppress false positives and distractions. Extensive experiments are conducted on two challenging datasets, ActivityNet-1.3 and THUMOS14, and the results demonstrate that our method achieves superior performance.

引用

页码：15516 / 15536

页数：20

共 83 条

[1]

Ding X(2021)Kfc: an efficient framework for semi-supervised temporal action localization IEEE Trans Image Process 30 6869-6878

[2]

Wang N(2022)Linear dynamical systems approach for human action recognition with dual-stream deep features Appl Intell 52 452-470

[3]

Gao X(2020)Play and rewind: context-aware video temporal action proposals Pattern Recogn 107477 107-7057

[4]

Li J(2021)An efficient attention module for 3d convolutional neural networks in action recognition Appl Intell 51 7043-16

[5]

Wang X(2021)Centerness-aware network for temporal action proposal IEEE Trans Circuits Syst Video Technol 32 5-3094

[6]

Liu T(2020)Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: application in video surveillance Knowl-Based Syst 105590 194-8548

[7]

Du Z(2021)Multi-modal 3d object detection by 2d-guided precision anchor proposal and multi-layer fusion Appl Soft Comput 107405 108-2029

[8]

Mukaidani H(2022)Dual relation network for temporal action localization Pattern Recogn 108725 129-17297

[9]

Gao L(2022)Probabilistic temporal modeling for unintentional action localization IEEE Trans Image Process 31 3081-95

[10]

Li T(2020)Revisiting anchor mechanisms for temporal action localization IEEE Trans Image Process 29 8535-4375

← 1 2 3 4 5 6 7 8 9 →