CL-HOI: Cross-level human-object interaction distillation from multimodal large language models

被引：0

作者：

Gao, Jianjun ^{[1
]}

Cai, Chen ^{[1
]}

Wang, Ruoyu ^{[1
]}

Liu, Wenyang ^{[1
]}

Yap, Kim-Hui ^{[1
]}

Garg, Kratika ^{[2
]}

Han, Boon Siew ^{[2
]}

机构：

[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore, Singapore

[2] Nanyang Technol Univ, Schaeffler Hub Adv Res, Singapore, Singapore

来源：

KNOWLEDGE-BASED SYSTEMS | 2025年 / 320卷

关键词：

Human-object interaction; Interaction detection; Knowledge distillation; Relation understanding;

D O I：

10.1016/j.knosys.2025.113561

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human-object interaction (HOI) detection often relies on labor-intensive annotations, but multimodal large language models (MLLMs) show potential for recognizing and reasoning about image-level interactions. However, MLLMs are typically computationally heavy and lack instance-level HOI detection capabilities. In this paper, we propose a cross-level HOI distillation (CL-HOI) framework that distills instance-level HOI detection from MLLMs, expanding HOI detection without labor-intensive and expensive manual annotations. Our approach uses CL-HOI as a student model to distill HOIs from a teacher MLLM in two stages: context distillation, where a visual-linguistic translator (VLT) converts visual information into linguistic form, and interaction distillation, where an interaction cognition network (ICN) facilitates interaction reasoning. Contrastive distillation losses transfer image-level context and interactions to the VLT and ICN for instance-level HOI detection. Evaluations on the HICO-DET and V-COCO datasets show that our method outperforms existing weakly supervised approaches, demonstrating its effectiveness in HOI detection without manual annotations.

引用

页数：10

共 50 条

[1] Acoustic Signature Recognition Technique for Human-Object Interactions (HOI) in Persistent Surveillance Systems [J].

Alkilani, Amjad ;

Shirkhodaie, Amir .

SIGNAL PROCESSING, SENSOR FUSION, AND TARGET RECOGNITION XXII, 2013, 8745

[2]

Cao Y., 2024, Adv. Neural Inf. Process. Syst., V36

[3] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[4] Learning to Detect Human-Object Interactions [J].

Chao, Yu-Wei ;

Liu, Yunfan ;

Liu, Xieyang ;

Zeng, Huayi ;

Deng, Jia .

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :381-389

[5] How Should My Chatbot Interact? A Survey on Social Characteristics in Human-Chatbot Interaction Design [J].

Chaves, Ana Paula ;

Gerosa, Marco Aurelio .

INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2021, 37 (08) :729-758

[6]

Chiang W.-L., 2023, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, V2, P6

[7]

Dai WL, 2023, Arxiv, DOI arXiv:2305.06500

[8] Pairwise Body-Part Attention for Recognizing Human-Object Interactions [J].

Fang, Hao-Shu ;

Cao, Jinkun ;

Tai, Yu-Wing ;

Lu, Cewu .

COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 :52-68

[9]

Gao C., 2018, ican: instancecentric attention network for humanobject interaction detection

[10] CONTEXTUAL HUMAN OBJECT INTERACTION UNDERSTANDING FROM PRE-TRAINED LARGE LANGUAGE MODEL [J].

Gao, Jianjun ;

Yap, Kim-Hui ;

Wu, Kejun ;

Phan, Duc Tri ;

Garg, Kratika ;

Han, Boon Siew .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :13436-13440

← 1 2 3 4 5 →