Adaptive Visual Memory Network for Visual Dialog

被引:0
|
作者
Zhao L. [1 ]
Gao L. [1 ]
Song J. [1 ]
机构
[1] School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu
来源
Gao, Lianli (juana.alian@gmail.com) | 1600年 / Univ. of Electronic Science and Technology of China卷 / 50期
关键词
Adaptive; Attention mechanism; Memory network; Visual dialog;
D O I
10.12178/1001-0548.2021057
中图分类号
学科分类号
摘要
The key challenge in visual dialogs is the problem of visual co-reference resolution. This paper proposes an adaptive visual memory network (AVMN), which applies external memory bank to directly store grounded visual information. The textual and visual positioning processes are integrated so that the possible errors in the two processes are effectively relieved. Moreover, the answers can be produced only based on the question and image in many cases. The historical information somewhat causes unnecessary errors, so we adaptively read the external visual memory. Furthermore, a residual queried image is fused with the attended memory. The experiment indicates that our proposed method outperforms the recent approaches on the evaluation metrics. © 2021, Editorial Board of Journal of the University of Electronic Science and Technology of China. All right reserved.
引用
收藏
页码:749 / 753
页数:4
相关论文
共 14 条
  • [1] HE K M, ZHANG X Y, REN S Q, Et al., Deep residual learning for image recognition, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
  • [2] LIN Y O, LEI H, LI X Y, Et al., Deep learning in NLP: Methods and application, Journal of University of Electronic Science and Technology of China, 46, 6, pp. 913-919, (2017)
  • [3] TAKMAZ E, PEZZELLE S, BEINBORN L, Et al., Generating image descriptions via sequential cross-modal alignment guided by human gaze, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4664-4677, (2020)
  • [4] ZHOU Y E, WANG M, LIU D Q, Et al., More grounded image captioning by distilling image-text matching model, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 4776-4785, (2020)
  • [5] LI X P, SONG J K, GAO L L, Et al., Beyond RNNs: Positional self-attention with co-Attention for video question answering, The 31st Innovative Applications of Artificial Intelligence Conference, pp. 8658-8665, (2019)
  • [6] LE T M, LE V, VENKATESH S, Et al., Hierarchical Conditional relation networks for video question answering, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 9969-9978, (2020)
  • [7] KOTTUR S, MOURA J, PARIKH D, Et al., Visual coreference resolution in visual dialog using neural module networks, The 15th European Conference on Computer Vision, pp. 160-178, (2018)
  • [8] KANG G, LIM J, ZHANG B., Dual attention networks for visual reference resolution in visual dialog, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 2024-2033, (2019)
  • [9] NIU Y L, ZHANG H W, ZHANG M L, Et al., Recursive visual attention in visual dialog, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 6679-6688, (2019)
  • [10] DAS A, KOTTUR S, GUPTA K, Et al., Visual dialog, The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1080-1089, (2017)