Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

被引:7
作者
Planamente, Mirco [1 ,2 ]
Bottino, Andrea [1 ]
Caputo, Barbara [1 ,2 ]
机构
[1] Politecn Torino, Dept Control & Comp Engn, Turin, Italy
[2] Italian Inst Technol, Genoa, Italy
来源
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR) | 2021年
关键词
Egocentric Vision; Action Recognition; Multi-task Learning; Motion Prediction; Self-supervised Learning;
D O I
10.1109/ICPR48806.2021.9411972
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach.
引用
收藏
页码:8751 / 8758
页数:8
相关论文
共 50 条
[41]   Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition [J].
Sun, Shuyang ;
Kuang, Zhanghui ;
Sheng, Lu ;
Ouyang, Wanli ;
Zhang, Wei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1390-1399
[42]   H plus O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions [J].
Tekin, Bugra ;
Bogo, Federica ;
Pollefeys, Marc .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4506-4515
[43]   Dense Optical Flow Prediction from a Static Image [J].
Walker, Jacob ;
Gupta, Abhinav ;
Hebert, Martial .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2443-2451
[44]  
Wang J., 2019, SELF SUPERVISED SPAT
[45]  
Wang L., 2016, COMPUTER VISION ECCV, P20
[46]   Temporal Segment Networks for Action Recognition in Videos [J].
Wang, Limin ;
Xiong, Yuanjun ;
Wang, Zhe ;
Qiao, Yu ;
Lin, Dahua ;
Tang, Xiaoou ;
Van Gool, Luc .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (11) :2740-2755
[47]   Unsupervised Learning of Visual Representations using Videos [J].
Wang, Xiaolong ;
Gupta, Abhinav .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2794-2802
[48]  
Zhang P., 2018, CORR, P135
[49]   Efficient Temporal Sequence Comparison and Classification using Gram Matrix Embeddings On a Riemannian Manifold [J].
Zhang, Xikang ;
Wang, Yin ;
Gou, Mengran ;
Sznaier, Mario ;
Camps, Octavia .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4498-4507
[50]   Dance with Flow: Two-in-One Stream Action Detection [J].
Zhao, Jiaojiao ;
Snoek, Cees G. M. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :9927-9936