Multimodal graph inference network for scene graph generation

被引：4

作者：

Duan, Jingwen ^{[1
]}

Min, Weidong ^{[2
,3
]}

Lin, Deyu ^{[2
]}

Xu, Jianfeng ^{[2
]}

Xiong, Xin ^{[1
]}

机构：

[1] Nanchang Univ, Sch Informat Engn, Nanchang 330031, Jiangxi, Peoples R China

[2] Nanchang Univ, Sch Software, Nanchang 330047, Jiangxi, Peoples R China

[3] Jiangxi Key Lab Smart City, Nanchang 330047, Jiangxi, Peoples R China

来源：

APPLIED INTELLIGENCE | 2021年 / 51卷 / 12期

基金：

中国国家自然科学基金;

关键词：

Scene graph generation; Visual relationship detection; Image understanding; Semantic analysis;

D O I：

10.1007/s10489-021-02304-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A scene graph can describe images concisely and structurally. However, existing methods of scene graph generation have low capabilities of inferring certain relationships, because of the lack of semantic information and their heavy dependence on the statistical distribution of the training set. To alleviate the above problems, a Multimodal Graph Inference Network (MGIN), which includes two modules; Multimodal Information Extraction (MIE) and Target with Multimodal Feature Inference (TMFI), is proposed in this study. MGIN can increase the inference capability of triplets, especially for uncommon samples. In the proposed MIE module, the prior statistical knowledge of the training set is incorporated into the network in a reprocess to relieve the problem of overfitting to the training set. Visual and semantic features are extracted in the MIE module and fused as unified multimodal features in the TMFI module. These features are efficient for the inference module to increase the prediction capability of MGIN, especially for some uncommon samples. The proposed method achieves 27.0% average mean recall and 55.9% average recall, with improvements of 0.48% and 0.50%, respectively, compared with state-of-the-art methods. It also increases the average recall of 20 relationships with the lowest probability by 4.91%.

引用

页码：8768 / 8783

页数：16

共 51 条

[1]

[Anonymous], ARXIV151105493

[2] Cross-Modal Scene Networks [J].

Aytar, Yusuf ;

Castrejon, Lluis ;

Vondrick, Carl ;

Pirsiavash, Hamed ;

Torralba, Antonio .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (10) :2303-2314

[3] A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications [J].

Cai, HongYun ;

Zheng, Vincent W. ;

Chang, Kevin Chen-Chuan .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (09) :1616-1637

[4] Knowledge-Embedded Routing Network for Scene Graph Generation [J].

Chen, Tianshui ;

Yu, Weihao ;

Chen, Riquan ;

Lin, Liang .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6156-6164

[5]

Chen VS, 2019, IEEE I CONF COMP VIS, P2580, DOI [10.1109/ICCV.2019.00267, 10.1109/iccv.2019.00267]

[6] A Survey on Network Embedding [J].

Cui, Peng ;

Wang, Xiao ;

Pei, Jian ;

Zhu, Wenwu .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (05) :833-852

[7] Detecting Visual Relationships with Deep Relational Networks [J].

Dai, Bo ;

Zhang, Yuqi ;

Lin, Dahua .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3298-3308

[8] Saliency Detection via a Multiple Self-Weighted Graph-Based Manifold Ranking [J].

Deng, Cheng ;

Yang, Xu ;

Nie, Feiping ;

Tao, Dapeng .

IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) :885-896

[9] Scene Graph Generation with External Knowledge and Image Reconstruction [J].

Gu, Jiuxiang ;

Zhao, Handong ;

Lin, Zhe ;

Li, Sheng ;

Cai, Jianfei ;

Ling, Mingyang .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1969-1978

[10] Unsupervised discriminative feature representation via adversarial auto-encoder [J].

Guo, Wenzhong ;

Cai, Jinyu ;

Wang, Shiping .

APPLIED INTELLIGENCE, 2020, 50 (04) :1155-1171

← 1 2 3 4 5 6 →