Scene Text Visual Question Answering

被引:145
作者
Biten, Ali Furkan [1 ]
Tito, Ruben [1 ]
Mafla, Andres [1 ]
Gomez, Lluis [1 ]
Rusinol, Marcal [1 ]
Valveny, Ernest [1 ]
Jawahar, C. V. [2 ]
Karatzas, Dimosthenis [1 ]
机构
[1] UAB, Comp Vis Ctr, Barcelona, Spain
[2] IIIT Hyderabad, CVIT, Hyderabad, India
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
关键词
D O I
10.1109/ICCV.2019.00439
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.
引用
收藏
页码:4290 / 4300
页数:11
相关论文
共 60 条
  • [11] Control Analysis of PMSG based Wind Energy Conversion System using Buck-Boost Converter
    Arora, Khushboo
    Patel, Rachit
    Katiyar, Sapna
    [J]. 2016 SECOND INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE & COMMUNICATION TECHNOLOGY (CICT), 2016, : 395 - 402
  • [12] Biten Ali Furkan, 2019, ARXIV190700490
  • [13] Bojanowski Piotr, 2017, Trans. Assoc. Comput. Linguist., V5, P135, DOI DOI 10.1162/TACL_A_00051
  • [14] Rosetta: Large Scale System for Text Detection and Recognition in Images
    Borisyuk, Fedor
    Gordo, Albert
    Sivakumar, Viswanath
    [J]. KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 71 - 79
  • [15] Deep TextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework
    Busta, Michal
    Neumann, Lukas
    Matas, Jiri
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2223 - 2231
  • [16] Chattopadhyay Prithvijit, 2017, CVPR
  • [17] Deng J., 2020, IEEE C COMP VIS PATT, P248
  • [18] Gao H., 2015, Advances in neural information processing systems
  • [19] ICDAR2017 Robust Reading Challenge on COCO-Text
    Gomez, Raul
    Shi, Baoguang
    Gomez, Lluis
    Neumann, Lukas
    Veit, Andreas
    Matas, Jiri
    Belongie, Serge
    Karatzas, Dimosthenis
    [J]. 2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1435 - 1443
  • [20] Pointing the Unknown Words
    Gulcehre, Caglar
    Ahn, Sungjin
    Nallapati, Ramesh
    Zhou, Bowen
    Bengio, Yoshua
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 140 - 149