Adaptive Visual Memory Network for Visual Dialog

被引：0

作者：

Zhao L. ^{[1
]}

Gao L. ^{[1
]}

Song J. ^{[1
]}

机构：

[1] School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu

来源：

Gao, Lianli (juana.alian@gmail.com) | 1600年 / Univ. of Electronic Science and Technology of China卷 / 50期

关键词：

Adaptive; Attention mechanism; Memory network; Visual dialog;

D O I：

10.12178/1001-0548.2021057

中图分类号：

学科分类号：

摘要：

The key challenge in visual dialogs is the problem of visual co-reference resolution. This paper proposes an adaptive visual memory network (AVMN), which applies external memory bank to directly store grounded visual information. The textual and visual positioning processes are integrated so that the possible errors in the two processes are effectively relieved. Moreover, the answers can be produced only based on the question and image in many cases. The historical information somewhat causes unnecessary errors, so we adaptively read the external visual memory. Furthermore, a residual queried image is fused with the attended memory. The experiment indicates that our proposed method outperforms the recent approaches on the evaluation metrics. © 2021, Editorial Board of Journal of the University of Electronic Science and Technology of China. All right reserved.

引用

页码：749 / 753

页数：4

共 14 条

[1] HE K M, ZHANG X Y, REN S Q, Et al., Deep residual learning for image recognition, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
[2] LIN Y O, LEI H, LI X Y, Et al., Deep learning in NLP: Methods and application, Journal of University of Electronic Science and Technology of China, 46, 6, pp. 913-919, (2017)
[3] TAKMAZ E, PEZZELLE S, BEINBORN L, Et al., Generating image descriptions via sequential cross-modal alignment guided by human gaze, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4664-4677, (2020)
[4] ZHOU Y E, WANG M, LIU D Q, Et al., More grounded image captioning by distilling image-text matching model, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 4776-4785, (2020)
[5] LI X P, SONG J K, GAO L L, Et al., Beyond RNNs: Positional self-attention with co-Attention for video question answering, The 31st Innovative Applications of Artificial Intelligence Conference, pp. 8658-8665, (2019)
[6] LE T M, LE V, VENKATESH S, Et al., Hierarchical Conditional relation networks for video question answering, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 9969-9978, (2020)
[7] KOTTUR S, MOURA J, PARIKH D, Et al., Visual coreference resolution in visual dialog using neural module networks, The 15th European Conference on Computer Vision, pp. 160-178, (2018)
[8] KANG G, LIM J, ZHANG B., Dual attention networks for visual reference resolution in visual dialog, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 2024-2033, (2019)
[9] NIU Y L, ZHANG H W, ZHANG M L, Et al., Recursive visual attention in visual dialog, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 6679-6688, (2019)
[10] DAS A, KOTTUR S, GUPTA K, Et al., Visual dialog, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1080-1089, (2017)

← 1 2 →