RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering

被引：20

作者：

Jin, Zan-Xia ^{[1
]}

Wu, Heran ^{[1
]}

Yang, Chun ^{[1
]}

Zhou, Fang ^{[1
]}

Qin, Jingyan ^{[1
,2
]}

Xiao, Lei ^{[3
]}

Yin, Xu-Cheng ^{[1
,4
,5
]}

机构：

[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Dept Comp Sci & Technol, Beijing 100083, Peoples R China

[2] Univ Sci & Technol Beijing, Sch Mech Engn, Dept Ind Design, Beijing 100083, Peoples R China

[3] Tencent Technol Shenzhen Co Ltd, Shenzhen 518057, Peoples R China

[4] Univ Sci & Technol Beijing, Inst Artificial Intelligence, Beijing 100083, Peoples R China

[5] Univ Sci & Technol Beijing, USTB EEasy Tech Joint Lab Artificial Intelligence, Beijing 100083, Peoples R China

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Attention mechanism; computer vision; machine reading comprehension; natural language processing; visual question answering;

D O I：

10.1109/TMM.2021.3120194

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.

引用

页码：1 / 12

页数：12

共 62 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[2] [Anonymous], 2016, PROC C EMPIRICAL MET
[3] [Anonymous], 2012, Proceedings of the 21st International Conference on World Wide Web
[4] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[5] Character Region Awareness for Text Detection
Baek, Youngmin
Lee, Bado
Han, Dongyoon
Yun, Sangdoo
Lee, Hwalsuk
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 9357 - 9366
[6] MUTAN: Multimodal Tucker Fusion for Visual Question Answering
Ben-younes, Hedi
Cadene, Remi
Cord, Matthieu
Thome, Nicolas
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2631 - 2639
[7] Biten Ali Furkan, 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR). Proceedings, P1563, DOI 10.1109/ICDAR.2019.00251
[8] Scene Text Visual Question Answering
Biten, Ali Furkan
Tito, Ruben
Mafla, Andres
Gomez, Lluis
Rusinol, Marcal
Valveny, Ernest
Jawahar, C. V.
Karatzas, Dimosthenis
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
[9] Visual Relationship Embedding Network for Image Paragraph Generation
Che, Wenbin
Fan, Xiaopeng
Xiong, Ruiqin
Zhao, Debin
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (09) : 2307 - 2320
[10] Reading Wikipedia to Answer Open-Domain Questions
Chen, Danqi
Fisch, Adam
Weston, Jason
Bordes, Antoine
[J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1870 - 1879

← 1 2 3 4 5 6 7 →