STFormer: Spatio-temporal former for hand-object interaction recognition from egocentric RGB video

被引:0
|
作者
Liang, Jiao [1 ,2 ]
Wang, Xihan [1 ,2 ]
Yang, Jiayi [1 ,2 ]
Gao, Quanli [1 ,2 ]
机构
[1] Xian Polytech Univ, State Prov Joint Engn & Res Ctr Adv Networking & I, Xian, Peoples R China
[2] Xian Polytech Univ, Sch Comp Sci, Xian, Peoples R China
基金
中国国家自然科学基金;
关键词
computer vision; image classification; pose estimation;
D O I
10.1049/ell2.70010
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In recent years, video-based hand-object interaction has received widespread attention from researchers. However, due to the complexity and occlusion of hand movements, hand-object interaction recognition based on RGB videos remains a highly challenging task. Here, an end-to-end spatio-temporal former (STFormer) network for understanding hand behaviour in interactions is proposed. The network consists of three modules: FlexiViT feature extraction, hand-object pose estimator, and interaction action classifier. The FlexiViT is used to extract multi-scale features from each image frame. The hand-object pose estimator is designed to predict 3D hand pose keypoints and object labels for each frame. The interaction action classifier is used to predict the interaction action categories for the entire video. The experimental results demonstrate that our approach achieves competitive recognition accuracies of 94.96% and 88.84% on two datasets, namely first-person hand action (FPHA) and 2 Hands and Objects (H2O). We propose an end-to-end spatio-temporal former network for understanding hand behaviour in interactions. To attain semantic comprehending of lengthy videos, we predict 3D hand pose keypoints and interaction object labels for each image frame. We also incorporate the temporal dependency of video sequences to model the sequence of inter-frame relationships to predict their interaction action categories on the complete video. image
引用
收藏
页数:3
相关论文
共 50 条
  • [1] Interaction Estimation in Egocentric Videos via Simultaneous Hand-Object Recognition
    Benavent-Lledo, Manuel
    Oprea, Sergiu
    Alejandro Castro-Vargas, John
    Martinez-Gonzalez, Pablo
    Garcia-Rodriguez, Jose
    16TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING MODELS IN INDUSTRIAL AND ENVIRONMENTAL APPLICATIONS (SOCO 2021), 2022, 1401 : 439 - 448
  • [2] Generalizability of Hand-Object Interaction Detection in Egocentric Video across Populations with Hand Impairment
    Tsai, Meng-Fen
    Wang, Rosalie H.
    Zariffa, Jose
    42ND ANNUAL INTERNATIONAL CONFERENCES OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY: ENABLING INNOVATIVE TECHNOLOGIES FOR GLOBAL HEALTHCARE EMBC'20, 2020, : 3228 - 3231
  • [3] Object-Centric Spatio-Temporal Pyramids for Egocentric Activity Recognition
    McCandless, Tomas
    Grauman, Kristen
    PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2013, 2013,
  • [4] EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding
    Xu, Yue
    Li, Yong-Lu
    Huang, Zhemin
    Liu, Michael Xu
    Lu, Cewu
    Tai, Yu-Wing
    Tang, Chi-Keung
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5250 - 5261
  • [5] Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?
    Leonardi, Rosario
    Furnari, Antonino
    Ragusa, Francesco
    Farinella, Giovanni Maria
    COMPUTER VISION - ECCV 2024, PT LXXI, 2025, 15129 : 36 - 54
  • [6] Spatio-Temporal Object Recognition
    De Geest, Roeland
    Deboeverie, Francis
    Philips, Wilfried
    Tuytelaars, Tinne
    ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, ACIVS 2015, 2015, 9386 : 681 - 692
  • [7] Multimodal Emotion Recognition of Hand-Object Interaction
    Niewiadomski, Radoslaw
    Sciutti, Alessandra
    IUI '21 - 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES, 2021, : 351 - 355
  • [8] HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image
    Choi, Hongsuk
    Chavan-Dade, Nikhil
    Yuan, Jiacheng
    Isler, Volkan
    Park, Hyunsoo
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, : 13940 - 13946
  • [9] Fine-grained Affordance Annotation for Egocentric Hand-Object Interaction Videos
    Yu, Zecheng
    Huang, Yifei
    Furuta, Ryosuke
    Yagi, Takuma
    Goutsu, Yusuke
    Sato, Yoichi
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2154 - 2162
  • [10] SHOWMe: Robust object-agnostic hand-object 3D reconstruction from RGB video
    Swamy, Anilkumar
    Leroy, Vincent
    Weinzaepfel, Philippe
    Baradel, Fabien
    Galaaoui, Salma
    Bregier, Romain
    Armando, Matthieu
    Franco, Jean-Sebastien
    Rogez, Gregory
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 247