ArtEmis: Affective Language for Visual Art

被引:57
作者
Achlioptas, Panos [1 ]
Ovsjanikov, Maks [2 ]
Haydarov, Kilichbek [3 ]
Elhoseiny, Mohamed [1 ,3 ]
Guibas, Leonidas [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] IP Paris, Ecole Polytech, LIX, Paris, France
[3] King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.01140
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel large-scale dataset and accompanying machine learning models aimed at providing a detailed understanding of the interplay between visual content, its emotional effect, and explanations for the latter in language. In contrast to most existing annotation datasets in computer vision, we focus on the affective experience triggered by visual artworks and ask the annotators to indicate the dominant emotion they feel for a given image and, crucially, to also provide a grounded verbal explanation for their emotion choice. As we demonstrate below, this leads to a rich set of signals for both the objective content and the affective impact of an image, creating associations with abstract concepts (e.g., "freedom" or "love"), or references that go beyond what is directly visible, including visual similes and metaphors, or subjective references to personal experiences. We focus on visual art (e.g., paintings, artistic photographs) as it is a prime example of imagery created to elicit emotional responses from its viewers. Our dataset, termed ArtEmis, contains 455K emotion attributions and explanations from humans, on 80K artworks from WikiArt. Building on this data, we train and demonstrate a series of captioning systems capable of expressing and explaining emotions from visual stimuli. Remarkably, the captions produced by these systems often succeed in reflecting the semantic and abstract content of the image, going well beyond systems trained on existing datasets.
引用
收藏
页码:11564 / 11574
页数:11
相关论文
共 43 条
[1]  
Achlioptas Panos, 2020, EUR C COMP VIS ECCV
[2]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[3]  
Bradley MM., 2007, The International Affective Picture System (IASP) in the study of emotion and attention, P29, DOI DOI 10.1093/OSO/9780195169157.003.0003
[4]   Concreteness ratings for 40 thousand generally known English word lemmas [J].
Brysbaert, Marc ;
Warriner, Amy Beth ;
Kuperman, Victor .
BEHAVIOR RESEARCH METHODS, 2014, 46 (03) :904-911
[5]  
Chen X., 2015, J HENAN SCI TECHNOL, V3, P1
[6]  
Cornia M., 2020, Meshed-memory transformer for image captioning
[7]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[8]  
Denkowski M., 2014, P 9 WORKSH STAT MACH, P376
[9]  
Devlin J., 2018, arXiv:1810.04805
[10]   AN ARGUMENT FOR BASIC EMOTIONS [J].
EKMAN, P .
COGNITION & EMOTION, 1992, 6 (3-4) :169-200