Spatiotemporal distilled dense-connectivity network for video action recognition

被引：41

作者：

Hao, Wangli ^{[1
,3
]}

Zhang, Zhaoxiang ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci CASIA Beijing, Inst Automat, CRIPAC, NLPR, Beijing 100190, Peoples R China

[2] Ctr Excellence Brain Sci & Intelligence Technol C, Beijing 100190, Peoples R China

[3] Univ Chinese Acad Sci UCAS Beijing, Beijing 100190, Peoples R China

来源：

PATTERN RECOGNITION | 2019年 / 92卷

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Two-stream; Action recognition; Dense-connectivity; Knowledge distillation;

D O I：

10.1016/j.patcog.2019.03.005

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Two-stream convolutional neural networks show great promise for action recognition tasks. However, most two-stream based approaches train the appearance and motion subnetworks independently, which may lead to the decline in performance due to the lack of interactions among two streams. To overcome this limitation, we propose a Spatiotemporal Distilled Dense-Connectivity Network (STDDCN) for video action recognition. This network implements both knowledge distillation and dense-connectivity (adapted from DenseNet). Using this STDDCN architecture, we aim to explore interaction strategies between appearance and motion streams along different hierarchies. Specifically, block-level dense connections between appearance and motion pathways enable spatiotemporal interaction at the feature representation layers. Moreover, knowledge distillation among two streams (each treated as a student) and their last fusion (treated as teacher) allows both streams to interact at the high level layers. The special architecture of STDDCN allows it to gradually obtain effective hierarchical spatiotemporal features. Moreover, it can be trained end-to-end. Finally, numerous ablation studies validate the effectiveness and generalization of our model on two benchmark datasets, including UCF101 and HMDB51. Simultaneously, our model achieves promising performances. (C) 2019 Elsevier Ltd. All rights reserved.

引用

页码：13 / 24

页数：12

共 50 条

[1] Sparse Dense Transformer Network for Video Action Recognition
Qu, Xiaochun
Zhang, Zheyuan
Xiao, Wei
Ran, Jinye
Wang, Guodong
Zhang, Zili
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 43 - 56
[2] Dense Dilated Network for Video Action Recognition
Xu, Baohan
Ye, Hao
Zheng, Yingbin
Wang, Heng
Luwang, Tianyu
Jiang, Yu-Gang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (10) : 4941 - 4953
[3] Spatiotemporal squeeze-and-excitation residual multiplier network for video action recognition
Luo H.
Tong K.
Tongxin Xuebao/Journal on Communications, 2019, 40 (10): : 189 - 198
[4] Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition
Cai, Yutong
Lin, Weiyao
See, John
Cheng, Ming-Ming
Liu, Guangcan
Xiong, Hongkai
2018 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (IEEE VCIP), 2018,
[5] Spatiotemporal Fusion Networks for Video Action Recognition
Liu, Zheng
Hu, Haifeng
Zhang, Junxuan
NEURAL PROCESSING LETTERS, 2019, 50 (02) : 1877 - 1890
[6] Spatiotemporal Relation Networks for Video Action Recognition
Liu, Zheng
Hu, Haifeng
IEEE ACCESS, 2019, 7 : 14969 - 14976
[7] Spatiotemporal Fusion Networks for Video Action Recognition
Zheng Liu
Haifeng Hu
Junxuan Zhang
Neural Processing Letters, 2019, 50 : 1877 - 1890
[8] DC3D: A Video Action Recognition Network Based on Dense Connection
Mu, Xiaofang
Liu, Zhenyu
Liu, Jiaji
Li, Hao
Li, Yue
Li, Yikun
2022 TENTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA, CBD, 2022, : 133 - 138
[9] Spatiotemporal Saliency Representation Learning for Video Action Recognition
Kong, Yongqiang
Wang, Yunhong
Li, Annan
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1515 - 1528
[10] Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition
Tu, Zhigang
Li, Hongyan
Zhang, Dejun
Dauwels, Justin
Li, Baoxin
Yuan, Junsong
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (06) : 2799 - 2812

← 1 2 3 4 5 →