Few-shot human-object interaction video recognition with transformers

被引:17
作者
Li, Qiyue [1 ]
Xie, Xuemei [1 ]
Zhang, Jin [1 ]
Shi, Guangming [1 ]
机构
[1] Xidian Univ, Sch Artificial Intelligence, Xian 710071, Shaanxi, Peoples R China
关键词
Few-shot learning; Meta-learning; Human-object interaction recognition; Transformers; AFFORDANCES;
D O I
10.1016/j.neunet.2023.01.019
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel few-shot learning framework that can recognize human-object interaction (HOI) classes with a few labeled samples. We achieve this by leveraging a meta-learning paradigm where human-object interactions are embedded into compact features for similarity calculation. More specifically, spatial and temporal relationships of HOI in videos are constructed with transformers which boost the performance over the baseline significantly. First, we present a spatial encoder that extracts the spatial context and infers frame-level features of a human and objects in each frame. And then the video-level feature is obtained by encoding a series of frame-level feature vectors with a temporal encoder. Experiments on two datasets, CAD-120 and Something-Else, validate that our approach achieves 7.8% and 15.2% accuracy improvement on 1-shot task, 4.7% and 15.7% on 5-shot task, which outperforms the state-of-the-art methods.(c) 2023 Published by Elsevier Ltd.
引用
收藏
页码:1 / 9
页数:9
相关论文
共 39 条
[1]   Human-Object Interactions Are More than the Sum of Their Parts [J].
Baldassano, Christopher ;
Beck, Diane M. ;
Fei-Fei, Li .
CEREBRAL CORTEX, 2017, 27 (03) :2276-2288
[2]   Large-Scale Machine Learning with Stochastic Gradient Descent [J].
Bottou, Leon .
COMPSTAT'2010: 19TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL STATISTICS, 2010, :177-186
[3]   Few-Shot Video Classification via Temporal Alignment [J].
Cao, Kaidi ;
Ji, Jingwei ;
Cao, Zhangjie ;
Chang, Chien-Yi ;
Niebles, Juan Carlos .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10615-10624
[4]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[5]  
Chen WY, 2020, Arxiv, DOI arXiv:1904.04232
[6]  
Chung JY, 2014, Arxiv, DOI arXiv:1412.3555
[7]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[8]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[9]   Multiscale Vision Transformers [J].
Fan, Haoqi ;
Xiong, Bo ;
Mangalam, Karttikeya ;
Li, Yanghao ;
Yan, Zhicheng ;
Malik, Jitendra ;
Feichtenhofer, Christoph .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6804-6815
[10]   Detecting and Recognizing Human-Object Interactions [J].
Gkioxari, Georgia ;
Girshick, Ross ;
Dollar, Piotr ;
He, Kaiming .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8359-8367