RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

被引:20
作者
Jin, Zan-Xia [1 ]
Wu, Heran [1 ]
Yang, Chun [1 ]
Zhou, Fang [1 ]
Qin, Jingyan [1 ,2 ]
Xiao, Lei [3 ]
Yin, Xu-Cheng [1 ,4 ,5 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Dept Comp Sci & Technol, Beijing 100083, Peoples R China
[2] Univ Sci & Technol Beijing, Sch Mech Engn, Dept Ind Design, Beijing 100083, Peoples R China
[3] Tencent Technol Shenzhen Co Ltd, Shenzhen 518057, Peoples R China
[4] Univ Sci & Technol Beijing, Inst Artificial Intelligence, Beijing 100083, Peoples R China
[5] Univ Sci & Technol Beijing, USTB EEasy Tech Joint Lab Artificial Intelligence, Beijing 100083, Peoples R China
关键词
Attention mechanism; computer vision; machine reading comprehension; natural language processing; visual question answering;
D O I
10.1109/TMM.2021.3120194
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 62 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] [Anonymous], 2016, PROC C EMPIRICAL MET
  • [3] [Anonymous], 2012, Proceedings of the 21st International Conference on World Wide Web
  • [4] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [5] Character Region Awareness for Text Detection
    Baek, Youngmin
    Lee, Bado
    Han, Dongyoon
    Yun, Sangdoo
    Lee, Hwalsuk
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9357 - 9366
  • [6] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
    Ben-younes, Hedi
    Cadene, Remi
    Cord, Matthieu
    Thome, Nicolas
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
  • [7] Biten Ali Furkan, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P1563, DOI 10.1109/ICDAR.2019.00251
  • [8] Scene Text Visual Question Answering
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Gomez, Lluis
    Rusinol, Marcal
    Valveny, Ernest
    Jawahar, C. V.
    Karatzas, Dimosthenis
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
  • [9] Visual Relationship Embedding Network for Image Paragraph Generation
    Che, Wenbin
    Fan, Xiaopeng
    Xiong, Ruiqin
    Zhao, Debin
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (09) : 2307 - 2320
  • [10] Reading Wikipedia to Answer Open-Domain Questions
    Chen, Danqi
    Fisch, Adam
    Weston, Jason
    Bordes, Antoine
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1870 - 1879