DTR: An Information Bottleneck Based Regularization Framework for Video Action Recognition

被引:4
作者
Fan, Jiawei [1 ,2 ]
Zhao, Yu [1 ]
Yu, Xie [1 ,2 ]
Ma, Lihua [1 ]
Liu, Junqi [1 ]
Yi, Fangqiu [1 ]
Li, Boxun [1 ]
机构
[1] MEGVII Technol, Beijing, Peoples R China
[2] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
Video Action Recognition; Information Bottleneck Principle;
D O I
10.1145/3503161.3548326
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.
引用
收藏
页码:3877 / 3885
页数:9
相关论文
共 39 条
[1]  
[Anonymous], 2017, Shake-shake regularization
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]  
Chaudhari P, 2018, 2018 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA)
[4]  
Choi YI, 2020, I C INF COMM TECH CO, P1034, DOI 10.1109/ICTC49870.2020.9289535
[5]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[6]  
Federici M., 2020, PROC INT C LEARN REP
[7]   The "something something" video database for learning and evaluating visual common sense [J].
Goyal, Raghav ;
Kahou, Samira Ebrahimi ;
Michalski, Vincent ;
Materzynska, Joanna ;
Westphal, Susanne ;
Kim, Heuna ;
Haenel, Valentin ;
Fruend, Ingo ;
Yianilos, Peter ;
Mueller-Freitag, Moritz ;
Hoppe, Florian ;
Thurau, Christian ;
Bax, Ingo ;
Memisevic, Roland .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851
[8]   Deep Networks with Stochastic Depth [J].
Huang, Gao ;
Sun, Yu ;
Liu, Zhuang ;
Sedra, Daniel ;
Weinberger, Kilian Q. .
COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 :646-661
[9]  
Huang Teng-Hui, 2022, ARXIV220202684
[10]  
Huang Ziyuan, 2021, ARXIV211006178