Combination of temporal-channels correlation information and bilinear feature for action recognition

被引：12

作者：

Cai, Jiahui ^{[1
]}

Hu, Jianguo ^{[2
,3
]}

Li, Shiren ^{[1
]}

Lin, Jialing ^{[1
]}

Wang, Jun ^{[2
]}

机构：

[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou, Peoples R China

[2] Sun Yat Sen Univ, Sch Microelect Sci & Technol, Zhuhai, Peoples R China

[3] Dev Res Inst Guangzhou Smart City, Guangzhou, Peoples R China

来源：

IET COMPUTER VISION | 2020年 / 14卷 / 08期

关键词：

Classification (of information) - Convolution;

D O I：

10.1049/iet-cvi.2020.0023

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this study, the authors focus on improving the spatio-temporal representation ability of three-dimensional (3D) convolutional neural networks (CNNs) in the video domain. They observe two unfavourable issues: (i) the convolutional filters only dedicate to learning local representation along input channels. Also they treat channel-wise features equally, without emphasising the important features; (ii) traditional global average pooling layer only captures first-order statistics, ignoring finer detail features useful for classification. To mitigate these problems, they proposed two modules to boost 3D CNNs' performance, which are temporal-channel correlation (TCC) and bilinear pooling module. The TCC module can capture the information of inter-channel correlations over the temporal domain. Moreover, the TCC module generates channel-wise dependencies, which can adaptively re-weight the channel-wise features. Therefore, the network can focus on learning important features. With regards to the bilinear pooling module, it can capture more complex second-order statistics in deep features and generate a second-order classification vector. We can get more accurate classification results by combining the first-order and second-order classification vector. Extensive experiments show that adding our proposed modules to 130 network could consistently improve the performance and outperform the state-of-the-art methods. The code and models are available at https://github.com/ caijh33/13D_TCC_Bilinear.

引用

页码：634 / 641

页数：8

共 38 条

[1]

[Anonymous], ICML

[2]

[Anonymous], 2018, P EUROPEAN C COMPUTE

[3]

[Anonymous], 2017, NEW MODEL KINETICS D

[4]

Chen YP, 2018, ADV NEUR IN, V31

[5] PoTion: Pose MoTion Representation for Action Recognition [J].

Choutas, Vasileios ;

Weinzaepfel, Philippe ;

Revaud, Jerome ;

Schmid, Cordelia .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7024-7033

[6] MARS: Motion-Augmented RGB Stream for Action Recognition [J].

Crasto, Nieves ;

Weinzaepfel, Philippe ;

Alahari, Karteek ;

Schmid, Cordelia .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7874-7883

[7]

Diba A., 2018, P EUR C COMP VIS ECC

[8] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[9] Compact Bilinear Pooling [J].

Gao, Yang ;

Beijbom, Oscar ;

Zhang, Ning ;

Darrell, Trevor .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :317-326

[10]

Girdhar R, 2017, ADV NEUR IN, V30

← 1 2 3 4 →