A Multi-scale Interaction Motion Network for Action Recognition Based on Capsule Network

被引：0

作者：

Zheng, Xiangping ^{[1
]}

Liang, Xun ^{[1
]}

Wu, Bo ^{[1
]}

Wang, Jun ^{[2
]}

Guo, Yuhui ^{[1
]}

Zhang, Xuan ^{[1
]}

Mai, Yuefeng ^{[3
]}

机构：

[1] Renmin Univ China, Infomat Sch, Beijing, Peoples R China

[2] Swinburne Univ Technol, Melbourne, Vic, Australia

[3] Qufu Normal Univ, Shandong, Peoples R China

来源：

PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, action recognition has achieved impressive performance, mainly due to the aid of deep convolutional neural networks and large datasets. Traditionally, most efforts in action recognition have focused on capturing motion information by dense optical flow, but optical flow extraction is very time-consuming. Moreover, prior arts seek to improve accuracy but neglect the part-whole relationship between objects in videos, which may be self-defeating and even deteriorate the performance of methods. To circumvent the above challenges, we present a novel collaborative multipath capsule network (CMCN) for action recognition. In particular, we propose a plug-and-play collaborative multipath block containing spatiotemporal, channel, and motion units, which are complementary and crucial information for action recognition. We exploit the interaction of these three units and selectively emphasize informative spatial-temporal motion to reduce the expensive computational costs. Subsequently, we explore a new capsule voting procedure to reduce the computation used in the capsule dynamic routing mechanism. The critical insight is that the same type of capsules simulates the same entity in different positions, and their voting results should be consistent. This strategy lessens the number of learning parameters that backward pass in the training process, and thus strengthens part-whole relationships in a video. Extensive experiments on multiple real-world datasets for action recognition demonstrate that our model significantly outperforms state-of-the-art models.

引用

页码：505 / 513

页数：9

共 35 条

[1] Afshar P, 2018, IEEE IMAGE PROC, P3129, DOI 10.1109/ICIP.2018.8451379
[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[3] Spatio-temporal Channel Correlation Networks for Action Classification
Diba, Ali
Fayyaz, Mohsen
Sharma, Vivek
Arzani, M. Mahdi
Yousefzadeh, Rahman
Gall, Juergen
Van Gool, Luc
[J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 299 - 315
[4] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[5] Duarte K, 2018, ADV NEUR IN, V31
[6] SlowFast Networks for Video Recognition
Feichtenhofer, Christoph
Fan, Haoqi
Malik, Jitendra
He, Kaiming
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210
[7] Dual Attention Network for Scene Segmentation
Fu, Jun
Liu, Jing
Tian, Haijie
Li, Yong
Bao, Yongjun
Fang, Zhiwei
Lu, Hanqing
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3141 - 3149
[8] Gagana B, 2018, 2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), P1172, DOI 10.1109/ICACCI.2018.8554604
[9] Gowda SN, 2021, AAAI CONF ARTIF INTE, V35, P1451
[10] The "something something" video database for learning and evaluating visual common sense
Goyal, Raghav
Kahou, Samira Ebrahimi
Michalski, Vincent
Materzynska, Joanna
Westphal, Susanne
Kim, Heuna
Haenel, Valentin
Fruend, Ingo
Yianilos, Peter
Mueller-Freitag, Moritz
Hoppe, Florian
Thurau, Christian
Bax, Ingo
Memisevic, Roland
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5843 - 5851

← 1 2 3 4 →