STFormer: Spatio-temporal former for hand-object interaction recognition from egocentric RGB video

被引:0
作者
Liang, Jiao [1 ,2 ]
Wang, Xihan [1 ,2 ]
Yang, Jiayi [1 ,2 ]
Gao, Quanli [1 ,2 ]
机构
[1] Xian Polytech Univ, State Prov Joint Engn & Res Ctr Adv Networking & I, Xian, Peoples R China
[2] Xian Polytech Univ, Sch Comp Sci, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
computer vision; image classification; pose estimation;
D O I
10.1049/ell2.70010
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In recent years, video-based hand-object interaction has received widespread attention from researchers. However, due to the complexity and occlusion of hand movements, hand-object interaction recognition based on RGB videos remains a highly challenging task. Here, an end-to-end spatio-temporal former (STFormer) network for understanding hand behaviour in interactions is proposed. The network consists of three modules: FlexiViT feature extraction, hand-object pose estimator, and interaction action classifier. The FlexiViT is used to extract multi-scale features from each image frame. The hand-object pose estimator is designed to predict 3D hand pose keypoints and object labels for each frame. The interaction action classifier is used to predict the interaction action categories for the entire video. The experimental results demonstrate that our approach achieves competitive recognition accuracies of 94.96% and 88.84% on two datasets, namely first-person hand action (FPHA) and 2 Hands and Objects (H2O). We propose an end-to-end spatio-temporal former network for understanding hand behaviour in interactions. To attain semantic comprehending of lengthy videos, we predict 3D hand pose keypoints and interaction object labels for each image frame. We also incorporate the temporal dependency of video sequences to model the sequence of inter-frame relationships to predict their interaction action categories on the complete video. image
引用
收藏
页数:3
相关论文
empty
未找到相关数据