STFormer: Spatio-temporal former for hand-object interaction recognition from egocentric RGB video

被引：0

作者：

Liang, Jiao ^{[1
,2
]}

Wang, Xihan ^{[1
,2
]}

Yang, Jiayi ^{[1
,2
]}

Gao, Quanli ^{[1
,2
]}

机构：

[1] Xian Polytech Univ, State Prov Joint Engn & Res Ctr Adv Networking & I, Xian, Peoples R China

[2] Xian Polytech Univ, Sch Comp Sci, Xian, Peoples R China

来源：

ELECTRONICS LETTERS | 2024年 / 60卷 / 17期

基金：

中国国家自然科学基金;

关键词：

computer vision; image classification; pose estimation;

D O I：

10.1049/ell2.70010

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In recent years, video-based hand-object interaction has received widespread attention from researchers. However, due to the complexity and occlusion of hand movements, hand-object interaction recognition based on RGB videos remains a highly challenging task. Here, an end-to-end spatio-temporal former (STFormer) network for understanding hand behaviour in interactions is proposed. The network consists of three modules: FlexiViT feature extraction, hand-object pose estimator, and interaction action classifier. The FlexiViT is used to extract multi-scale features from each image frame. The hand-object pose estimator is designed to predict 3D hand pose keypoints and object labels for each frame. The interaction action classifier is used to predict the interaction action categories for the entire video. The experimental results demonstrate that our approach achieves competitive recognition accuracies of 94.96% and 88.84% on two datasets, namely first-person hand action (FPHA) and 2 Hands and Objects (H2O). We propose an end-to-end spatio-temporal former network for understanding hand behaviour in interactions. To attain semantic comprehending of lengthy videos, we predict 3D hand pose keypoints and interaction object labels for each image frame. We also incorporate the temporal dependency of video sequences to model the sequence of inter-frame relationships to predict their interaction action categories on the complete video. image

引用

页数：3

共 50 条

[1] Interaction Estimation in Egocentric Videos via Simultaneous Hand-Object Recognition
Benavent-Lledo, Manuel
Oprea, Sergiu
Alejandro Castro-Vargas, John
Martinez-Gonzalez, Pablo
Garcia-Rodriguez, Jose
16TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING MODELS IN INDUSTRIAL AND ENVIRONMENTAL APPLICATIONS (SOCO 2021), 2022, 1401 : 439 - 448
[2] Generalizability of Hand-Object Interaction Detection in Egocentric Video across Populations with Hand Impairment
Tsai, Meng-Fen
Wang, Rosalie H.
Zariffa, Jose
42ND ANNUAL INTERNATIONAL CONFERENCES OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY: ENABLING INNOVATIVE TECHNOLOGIES FOR GLOBAL HEALTHCARE EMBC'20, 2020, : 3228 - 3231
[3] Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition
McCandless, Tomas
Grauman, Kristen
PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2013, 2013,
[4] EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding
Xu, Yue
Li, Yong-Lu
Huang, Zhemin
Liu, Michael Xu
Lu, Cewu
Tai, Yu-Wing
Tang, Chi-Keung
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5250 - 5261
[5] Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?
Leonardi, Rosario
Furnari, Antonino
Ragusa, Francesco
Farinella, Giovanni Maria
COMPUTER VISION - ECCV 2024, PT LXXI, 2025, 15129 : 36 - 54
[6] Spatio-Temporal Object Recognition
De Geest, Roeland
Deboeverie, Francis
Philips, Wilfried
Tuytelaars, Tinne
ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, ACIVS 2015, 2015, 9386 : 681 - 692
[7] Multimodal Emotion Recognition of Hand-Object Interaction
Niewiadomski, Radoslaw
Sciutti, Alessandra
IUI '21 - 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES, 2021, : 351 - 355
[8] HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image
Choi, Hongsuk
Chavan-Dade, Nikhil
Yuan, Jiacheng
Isler, Volkan
Park, Hyunsoo
2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, : 13940 - 13946
[9] Fine-grained Affordance Annotation for Egocentric Hand-Object Interaction Videos
Yu, Zecheng
Huang, Yifei
Furuta, Ryosuke
Yagi, Takuma
Goutsu, Yusuke
Sato, Yoichi
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2154 - 2162
[10] SHOWMe: Robust object-agnostic hand-object 3D reconstruction from RGB video
Swamy, Anilkumar
Leroy, Vincent
Weinzaepfel, Philippe
Baradel, Fabien
Galaaoui, Salma
Bregier, Romain
Armando, Matthieu
Franco, Jean-Sebastien
Rogez, Gregory
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 247

← 1 2 3 4 5 →