Describing and Localizing Multiple Changes with Transformers

被引：42

作者：

Qiu, Yue ^{[1
]}

Yamamoto, Shintaro ^{[1
,2
]}

Nakashima, Kodai ^{[1
]}

Suzuki, Ryota ^{[1
]}

Iwata, Kenji ^{[1
]}

Kataoka, Hirokatsu ^{[1
]}

Satoh, Yutaka ^{[1
]}

机构：

[1] Natl Inst Adv Ind Sci & Technol, Tokyo, Japan

[2] Waseda Univ, Tokyo, Japan

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

关键词：

ATTENTION;

D O I：

10.1109/ICCV48922.2021.00198

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Change captioning tasks aim to detect changes in image pairs observed before and after a scene change and generate a natural language description of the changes. Existing change captioning studies have mainly focused on a single change. However, detecting and describing multiple changed parts in image pairs is essential for enhancing adaptability to complex scenarios. We solve the above issues from three aspects: (i) We propose a simulation-based multi-change captioning dataset; (ii) We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning; (iii) We further propose Multi-Change Captioning transformers (MCCFormers) that identify change regions by densely correlating different regions in image pairs and dynamically determines the related change regions with words in sentences. The proposed method obtained the highest scores on four conventional change captioning evaluation metrics for multi-change captioning. Additionally, our proposed method can separate attention maps for each change and performs well with respect to change localization. Moreover, the proposed framework outperformed the previous state-of-the-art methods on an existing change captioning benchmark, CLEVR-Change, by a large margin (+6.1 on BLEU-4 and +9.7 on CIDEr scores), indicating its general ability in change captioning tasks. The code and dataset are available at the project page(1).

引用

页码：1951 / 1960

页数：10

共 41 条

[1] Street-view change detection with deconvolutional networks [J].

Alcantarilla, Pablo F. ;

Stent, Simon ;

Ros, German ;

Arroyo, Roberto ;

Gherardi, Riccardo .

AUTONOMOUS ROBOTS, 2018, 42 (07) :1301-1322

[2]

Ambrus R, 2014, IEEE INT C INT ROBOT, P1854, DOI 10.1109/IROS.2014.6942806

[3] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[4] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[5]

Banerjee S., 2005, P ACL WORKSH INTR EX, P65

[6] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[7]

Coppin P.R., 1996, Remote Sens. Rev, V13, P207, DOI [10.1080/02757259609532305, DOI 10.1080/02757259609532305]

[8] Meshed-Memory Transformer for Image Captioning [J].

Cornia, Marcella ;

Stefanini, Matteo ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584

[9]

Daudt RC, 2018, IEEE IMAGE PROC, P4063, DOI 10.1109/ICIP.2018.8451652

[10]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

← 1 2 3 4 5 →