Relational graph neural network for situation recognition

被引：13

作者：

Jing, Ya ^{[1
,2
,3
]}

Wang, Junbo ^{[1
,2
,3
,4
]}

Wang, Wei ^{[1
,2
,3
]}

Wang, Liang ^{[1
,2
,3
]}

Tan, Tieniu ^{[1
,2
,3
]}

机构：

[1] CASIA, Ctr Res Intelligent Percept & Comp, Beijing 100190, Peoples R China

[2] CASIA, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China

[3] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

[4] Tencent Games, Shenzhen, Guangdong, Peoples R China

来源：

PATTERN RECOGNITION | 2020年 / 108卷

基金：

中国国家自然科学基金;

关键词：

Situation recognition; Relationship modeling; Graph neural network; Reinforcement learning;

D O I：

10.1016/j.patcog.2020.107544

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, situation recognition as a new challenging task for image understanding has gained great attention, which needs to simultaneously predict the main activity (verb) and its associated objects (noun entities) in a structured and detailed way. Several methods have been proposed to handle this task, but usually they cannot effectively model the relationships between the activity and the objects. In this paper, we propose a Relational Graph Neural Network (RGNN) for situation recognition, which builds a neural graph on the activity and the objects, and models the triplet relationships between the activity and pairs of objects through message passing between graph nodes. Moreover, we propose a two-stage training strategy to optimize the model. A progressive supervised learning is first adopted to obtain an initial prediction for the activity and the objects. Then, the initial predictions are refined by using a policy-gradient method to directly optimize the non-differentiable value-all metric. To verify the effectiveness of our method, we perform extensive experiments on the Imsitu dataset which is currently the only available dataset for situation recognition. Experimental results show that our approach outperforms the state-ofthe-art methods on verb and value metrics, and demonstrates better relationships between the activity and the objects . (c) 2020 Elsevier Ltd. All rights reserved.

引用

页数：11

共 46 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2]

[Anonymous], 2015, arXiv

[3]

[Anonymous], 2011, P INT C NEUR INF PRO

[4]

[Anonymous], 2016, P C ASS MACH TRANSL

[5]

[Anonymous], 2009, EMNLP

[6]

[Anonymous], 1998, WORDNET

[7] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[8]

Bruna J, 2013, 2 INT C LEARN REPR I

[9]

Dai B., 2017, CVPR, P3076

[10]

Defferrard M, 2016, ADV NEUR IN, V29

← 1 2 3 4 5 →