Scene-Text Aware Image and Text Retrieval with Dual-Encoder

被引:0
作者
Miyawaki, Shumpei [1 ]
Hasegawa, Taku [2 ]
Nishida, Kyosuke [2 ]
Kato, Takuma [1 ]
Suzuki, Jun [1 ]
机构
[1] Tohoku Univ, Sendai, Miyagi, Japan
[2] NTT Human Informat Labs, Tokyo, Japan
来源
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): STUDENT RESEARCH WORKSHOP | 2022年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We tackle the tasks of image and text retrieval using a dual-encoder model in which images and text are encoded independently. This model has attracted attention as an approach that enables efficient offline inferences by connecting both vision and language in the same semantic space. However, whether an image encoder as part of a dual-encoder model can interpret scene-text, i.e., the textual information in images, is unclear. We propose pre-training methods that encourage a joint understanding of the scene-text and surrounding visual information. The experimental results demonstrate that our methods improve the retrieval performances of the dual-encoder models.
引用
收藏
页码:422 / 433
页数:12
相关论文
共 36 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
Biten A.F., 2021, LaTr: Layout-Aware Transformer for Scene-Text VQA, P16548
[3]   Scene Text Visual Question Answering [J].
Biten, Ali Furkan ;
Tito, Ruben ;
Mafla, Andres ;
Gomez, Lluis ;
Rusinol, Marcal ;
Valveny, Ernest ;
Jawahar, C. V. ;
Karatzas, Dimosthenis .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4290-4300
[4]   Rosetta: Large Scale System for Text Detection and Recognition in Images [J].
Borisyuk, Fedor ;
Gordo, Albert ;
Sivakumar, Viswanath .
KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, :71-79
[5]   Text Recognition in the Wild: A Survey [J].
Chen, Xiaoxue ;
Jin, Lianwen ;
Zhu, Yuanzhi ;
Luo, Canjie ;
Wang, Tianwei .
ACM COMPUTING SURVEYS, 2021, 54 (02)
[6]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[7]  
Cho K., 2014, C EMP METH NAT LANG
[8]  
Devlin J., 2018, CORR
[9]   Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA [J].
Hu, Ronghang ;
Singh, Amanpreet ;
Darrell, Trevor ;
Rohrbach, Marcus .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9989-9999
[10]  
Huang Runhui, FILIP FINE GRAINED I