Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

被引:8
作者
Zhou, Yujie [1 ]
Qiang, Wenwen [2 ]
Rao, Anyi [3 ]
Lin, Ning [1 ]
Su, Bing [1 ,4 ]
Wang, Jiaqi [5 ]
机构
[1] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Chinese Acad Sci, Inst Software, Beijing, Peoples R China
[3] Stanford Univ, Stanford, CA 94305 USA
[4] Beijing Key Lab Big Data Management & Anal Method, Beijing, Peoples R China
[5] Shanghai AI Lab, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
中国国家自然科学基金;
关键词
Zero-shot Learning; Human Skeleton Data; Action Recognition;
D O I
10.1145/3581783.3611888
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories. The key is to build the connection between visual and semantic space from seen to unseen classes. Previous studies have primarily focused on encoding sequences into a singular feature vector, with subsequent mapping the features to an identical anchor point within the embedded space. Their performance is hindered by 1) the ignorance of the global visual/semantic distribution alignment, which results in a limitation to capture the true interdependence between the two spaces. 2) the negligence of temporal information since the frame-wise features with rich action clues are directly pooled into a single feature vector. We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and maximization. Specifically, 1) we maximize the MI between visual and semantic space for distribution alignment; 2) we leverage the temporal information for estimating the MI by encouraging MI to increase as more frames are observed. Extensive experiments on three large-scale skeleton action datasets confirm the effectiveness of our method.
引用
收藏
页码:5302 / 5310
页数:9
相关论文
共 47 条
[1]  
[Anonymous], 2016, NIPS
[2]   Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications [J].
Brattoli, Biagio ;
Tighe, Joseph ;
Zhdanov, Fedor ;
Perona, Pietro ;
Chalupka, Krzysztof .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4612-4622
[3]   Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [J].
Cao, Zhe ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1302-1310
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   Skeleton-Based Action Recognition with Shift Graph Convolutional Network [J].
Cheng, Ke ;
Zhang, Yifan ;
He, Xiangyu ;
Chen, Weihan ;
Cheng, Jian ;
Lu, Hanqing .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :180-189
[6]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[7]  
Du Y, 2015, PROC CVPR IEEE, P1110, DOI 10.1109/CVPR.2015.7298714
[8]  
Frome A., 2013, Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume, V2, P2121
[9]  
Gan C, 2015, AAAI CONF ARTIF INTE, P3769
[10]  
Guo TY, 2022, AAAI CONF ARTIF INTE, P762