CL-HOI: Cross-level human-object interaction distillation from multimodal large language models

被引:0
作者
Gao, Jianjun [1 ]
Cai, Chen [1 ]
Wang, Ruoyu [1 ]
Liu, Wenyang [1 ]
Yap, Kim-Hui [1 ]
Garg, Kratika [2 ]
Han, Boon Siew [2 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore, Singapore
[2] Nanyang Technol Univ, Schaeffler Hub Adv Res, Singapore, Singapore
关键词
Human-object interaction; Interaction detection; Knowledge distillation; Relation understanding;
D O I
10.1016/j.knosys.2025.113561
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-object interaction (HOI) detection often relies on labor-intensive annotations, but multimodal large language models (MLLMs) show potential for recognizing and reasoning about image-level interactions. However, MLLMs are typically computationally heavy and lack instance-level HOI detection capabilities. In this paper, we propose a cross-level HOI distillation (CL-HOI) framework that distills instance-level HOI detection from MLLMs, expanding HOI detection without labor-intensive and expensive manual annotations. Our approach uses CL-HOI as a student model to distill HOIs from a teacher MLLM in two stages: context distillation, where a visual-linguistic translator (VLT) converts visual information into linguistic form, and interaction distillation, where an interaction cognition network (ICN) facilitates interaction reasoning. Contrastive distillation losses transfer image-level context and interactions to the VLT and ICN for instance-level HOI detection. Evaluations on the HICO-DET and V-COCO datasets show that our method outperforms existing weakly supervised approaches, demonstrating its effectiveness in HOI detection without manual annotations.
引用
收藏
页数:10
相关论文
共 50 条
[1]   Acoustic Signature Recognition Technique for Human-Object Interactions (HOI) in Persistent Surveillance Systems [J].
Alkilani, Amjad ;
Shirkhodaie, Amir .
SIGNAL PROCESSING, SENSOR FUSION, AND TARGET RECOGNITION XXII, 2013, 8745
[2]  
Cao Y., 2024, Adv. Neural Inf. Process. Syst., V36
[3]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[4]   Learning to Detect Human-Object Interactions [J].
Chao, Yu-Wei ;
Liu, Yunfan ;
Liu, Xieyang ;
Zeng, Huayi ;
Deng, Jia .
2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :381-389
[5]   How Should My Chatbot Interact? A Survey on Social Characteristics in Human-Chatbot Interaction Design [J].
Chaves, Ana Paula ;
Gerosa, Marco Aurelio .
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2021, 37 (08) :729-758
[6]  
Chiang W.-L., 2023, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, V2, P6
[7]  
Dai WL, 2023, Arxiv, DOI arXiv:2305.06500
[8]   Pairwise Body-Part Attention for Recognizing Human-Object Interactions [J].
Fang, Hao-Shu ;
Cao, Jinkun ;
Tai, Yu-Wing ;
Lu, Cewu .
COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 :52-68
[9]  
Gao C., 2018, ican: instancecentric attention network for humanobject interaction detection
[10]   CONTEXTUAL HUMAN OBJECT INTERACTION UNDERSTANDING FROM PRE-TRAINED LARGE LANGUAGE MODEL [J].
Gao, Jianjun ;
Yap, Kim-Hui ;
Wu, Kejun ;
Phan, Duc Tri ;
Garg, Kratika ;
Han, Boon Siew .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :13436-13440