StimuVAR: Spatiotemporal Stimuli-Aware Video Affective Reasoning with Multimodal Large Language Models

被引：0

作者：

Guo, Yuxiang ^{[1
]}

Siddiqui, Faizan ^{[2
]}

Zhao, Yang ^{[1
]}

Chellappa, Rama ^{[1
]}

Lo, Shao-Yuan ^{[2
]}

机构：

[1] Johns Hopkins Univ, Baltimore, MD 21218 USA

[2] Honda Res Inst USA, San Jose, CA 95134 USA

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2025年

关键词：

Video affective reasoning; Emotion recognition; Emotional stimuli; Multimodal large language models;

D O I：

10.1007/s11263-025-02495-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Predicting and reasoning how a video would make a human feel is crucial for developing socially intelligent systems. Although Multimodal Large Language Models (MLLMs) have shown impressive video understanding capabilities, they tend to focus more on the semantic content of videos, often overlooking emotional stimuli. Hence, most existing MLLMs fall short in estimating viewers' emotional reactions and providing plausible explanations. To address this issue, we propose StimuVAR, a spatiotemporal Stimuli-aware framework for Video Affective Reasoning (VAR) with MLLMs. StimuVAR incorporates a two-level stimuli-aware mechanism: frame-level awareness and token-level awareness. Frame-level awareness involves sampling video frames with events that are most likely to evoke viewers' emotions. Token-level awareness performs tube selection in the token space to make the MLLM concentrate on emotion-triggered spatiotemporal regions. Furthermore, we create VAR instruction data to perform affective training, steering MLLMs' reasoning strengths towards emotional focus and thereby enhancing their affective reasoning ability. To thoroughly assess the effectiveness of VAR, we provide a comprehensive evaluation protocol with extensive metrics. StimuVAR is the first MLLM-based method for viewer-centered VAR. Experiments demonstrate its superiority in understanding viewers' emotional responses to videos and providing coherent and insightful explanations.

引用

页数：17

共 68 条

[1] Affection: Learning Affective Explanations for Real-World Visual Data [J].

Achlioptas, Panos ;

Ovsjanikov, Maks ;

Guibas, Leonidas ;

Tulyakov, Sergey .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :6641-6651

[2] ArtEmis: Affective Language for Visual Art [J].

Achlioptas, Panos ;

Ovsjanikov, Maks ;

Haydarov, Kilichbek ;

Elhoseiny, Mohamed ;

Guibas, Leonidas .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11564-11574

[3]

[Anonymous], 2023, GPT-4 Technical Report

[4]

[Anonymous], 2024, Anthropic. Claude 3.5 sonnet

[5] The Role of User Emotions for Content Personalization in e-Commerce: Literature Review [J].

Bielozorov, Artem ;

Bezbradica, Marija ;

Helfert, Markus .

HCI IN BUSINESS, GOVERNMENT AND ORGANIZATIONS: ECOMMERCE AND CONSUMER BEHAVIOR, PT I, 2019, 11588 :177-193

[6] The perception and categorisation of emotional stimuli: A review [J].

Brosch, Tobias ;

Pourtois, Gilles ;

Sander, David .

COGNITION & EMOTION, 2010, 24 (03) :377-400

[7]

Brown TB, 2020, ADV NEUR IN, V33

[8]

Cheng Z., 2024, C NEUR INF PROC SYST

[9] Differentiable Patch Selection for Image Recognition [J].

Cordonnier, Jean-Baptiste ;

Mahendran, Aravindh ;

Dosovitskiy, Alexey ;

Weissenborn, Dirk ;

Uszkoreit, Jakob ;

Unterthiner, Thomas .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :2351-2360

[10] Self-report captures 27 distinct categories of emotion bridged by continuous gradients [J].

Cowen, Alan S. ;

Keltner, Dacher .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2017, 114 (38) :E7900-E7909

← 1 2 3 4 5 6 7 →