Atomic-action-based Contrastive Network for Weakly Supervised Temporal Language Grounding

被引：4

作者：

Wu, Hongzhou ^{[1
]}

Lyu, Yifan ^{[2
]}

Shen, Xingyu ^{[1
]}

Zhao, Xuechen ^{[3
]}

Wang, Mengzhu ^{[1
]}

Zhang, Xiang ^{[4
,5
]}

Luo, Zhigang ^{[1
]}

机构：

[1] Natl Univ Def Technol, Parallel & Distributed Proc Lab, Changsha, Peoples R China

[2] Univ Chinese Acad Sci, Inst Software, Chinese Acad Sci, Beijing, Peoples R China

[3] Natl Univ Def Technol, Sch Comp, Changsha, Peoples R China

[4] Natl Univ Def Technol, Inst Quantum Informat, Changsha, Peoples R China

[5] Natl Univ Def Technol, State Key Lab High Performance Comp, Changsha, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年

关键词：

weakly supervised temporal language grounding; cross-modal interaction; contrastive learning; atomic action; discriminative word;

D O I：

10.1109/ICME55011.2023.00263

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As one knows, an event often consists of several actions while each action is atomic. Inspired by this insight, we propose a novel framework named Atomic-action-based Contrastive Network model (ACN) for weakly supervised temporal language grounding task to localize the query-related event moment in an untrimmed video, without access to any temporal annotations. Specifically, ACN first determines the accurate moment boundary of each action in a query-agnostic way. This can adequately exploit homogeneous visual cues while impeding the heterogeneity of the query from hurting the atomicity of visual action, i.e., action boundary. To effectively localize the query-related event, we seek the discriminative words in the given query, and explore a composite-grained contrastive module to retrieve those corresponding atomic actions in the common latent space across modalities. This boosts feature discrimination of visual event segment to remove irrelevant action video segments. Experiments on two popular datasets show the efficacy of our model.

引用

页码：1523 / 1528

页数：6

共 22 条

[1] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[2] Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning [J].

Chen, Shaoxiang ;

Jiang, Yu-Gang .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8421-8431

[3]

Duan X, 2018, ADV NEUR IN, V31

[4] TALL: Temporal Activity Localization via Language Query [J].

Gao, Jiyang ;

Sun, Chen ;

Yang, Zhenheng ;

Nevatia, Ram .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5277-5285

[5]

Gao M., 2019, arXiv

[6] Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation [J].

Huang, Jiabo ;

Liu, Yang ;

Gong, Shaogang ;

Jin, Hailin .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :7179-7188

[7] Dense-Captioning Events in Videos [J].

Krishna, Ranjay ;

Hata, Kenji ;

Ren, Frederic ;

Fei-Fei, Li ;

Niebles, Juan Carlos .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :706-715

[8]

Li K, 2021, AAAI CONF ARTIF INTE, V35, P1902

[9]

Lin ZJ, 2020, AAAI CONF ARTIF INTE, V34, P11539

[10] Weakly Supervised Moment Localization with Decoupled Consistent Concept Prediction [J].

Ma, Fan ;

Zhu, Linchao ;

Yang, Yi .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (05) :1244-1258

← 1 2 3 →