DTR: An Information Bottleneck Based Regularization Framework for Video Action Recognition

被引：4

作者：

Fan, Jiawei ^{[1
,2
]}

Zhao, Yu ^{[1
]}

Yu, Xie ^{[1
,2
]}

Ma, Lihua ^{[1
]}

Liu, Junqi ^{[1
]}

Yi, Fangqiu ^{[1
]}

Li, Boxun ^{[1
]}

机构：

[1] MEGVII Technol, Beijing, Peoples R China

[2] Beijing Univ Posts & Telecommun, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

关键词：

Video Action Recognition; Information Bottleneck Principle;

D O I：

10.1145/3503161.3548326

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.

引用

页码：3877 / 3885

页数：9

共 39 条

[1]

[Anonymous], 2017, Shake-shake regularization

[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[3]

Chaudhari P, 2018, 2018 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA)

[4]

Choi YI, 2020, I C INF COMM TECH CO, P1034, DOI 10.1109/ICTC49870.2020.9289535

[5] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[6]

Federici M., 2020, PROC INT C LEARN REP

[7] The "something something" video database for learning and evaluating visual common sense [J].

Goyal, Raghav ;

Kahou, Samira Ebrahimi ;

Michalski, Vincent ;

Materzynska, Joanna ;

Westphal, Susanne ;

Kim, Heuna ;

Haenel, Valentin ;

Fruend, Ingo ;

Yianilos, Peter ;

Mueller-Freitag, Moritz ;

Hoppe, Florian ;

Thurau, Christian ;

Bax, Ingo ;

Memisevic, Roland .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851

[8] Deep Networks with Stochastic Depth [J].

Huang, Gao ;

Sun, Yu ;

Liu, Zhuang ;

Sedra, Daniel ;

Weinberger, Kilian Q. .

COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 :646-661

[9]

Huang Teng-Hui, 2022, ARXIV220202684

[10]

Huang Ziyuan, 2021, ARXIV211006178

← 1 2 3 4 →