CaMEL: Mean Teacher Learning for Image Captioning

被引:26
作者
Barraco, Manuele [1 ]
Stefanini, Matteo [1 ]
Cornia, Marcella [1 ]
Cascianelli, Silvia [1 ]
Baraldi, Lorenzo [1 ]
Cucchiara, Rita [1 ]
机构
[1] Univ Modena & Reggio Emilia, Modena, Italy
来源
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2022年
关键词
D O I
10.1109/ICPR56361.2022.9955644
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay between the two language models follows a mean teacher learning paradigm with knowledge distillation. Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors. When comparing with existing proposals, we demonstrate that our model provides state-of-the-art caption quality with a significantly reduced number of parameters. According to the CIDEr metric, we obtain a new state of the art on COCO when training without using external data. The source code and trained models will be made publicly available at: https://github.com/aimagelab/camel.
引用
收藏
页码:4087 / 4094
页数:8
相关论文
共 61 条
[1]  
Anderson P, 2018, PROC CVPR IEEE, P6077, DOI [10.1109/CVPR.2018.00636, 10.1002/ett.70087]
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
Aneja J., 2018, P IEEE CVF C COMP VI
[4]  
Anil R., 2018, ICLR
[5]  
Banerjee S., 2005, P ACL WORKSHOP INTRI, P65, DOI DOI 10.3115/1626355.1626389
[6]  
Bruno P., 2022, P INT C IM AN PROC
[7]  
Cagrandi M., 2021, P ACM INT C MULT RET
[8]   Emerging Properties in Self-Supervised Vision Transformers [J].
Caron, Mathilde ;
Touvron, Hugo ;
Misra, Ishan ;
Jegou, Herve ;
Mairal, Julien ;
Bojanowski, Piotr ;
Joulin, Armand .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640
[9]  
Cornia M., 2021, AI Communications, P1
[10]   SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability [J].
Cornia, Marcella ;
Baraldi, Lorenzo ;
Cucchiara, Rita .
2020 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2020, :1128-1134