MiniROAD: Minimal RNN Framework for Online Action Detection

被引：11

作者：

An, Joungbin ^{[1
]}

Kang, Hyolim ^{[1
]}

Han, Su Ho ^{[1
]}

Yang, Ming-Hsuan ^{[1
,2
,3
]}

Kim, Seon Joo ^{[1
]}

机构：

[1] Yonsei Univ, Seoul, South Korea

[2] UC Merced, Merced, CA USA

[3] Google Res, Mountain View, CA USA

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.00949

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Online Action Detection (OAD) is the task of identifying actions in streaming videos without access to future frames. Much effort has been devoted to effectively capturing long-range dependencies, with transformers receiving the spotlight for their ability to capture long-range temporal structures. In contrast, RNNs have received less attention lately, due to their lower performance compared to recent methods that utilize transformers. In this paper, we investigate the underlying reasons for the inferior performance of RNNs compared to transformer-based algorithms. Our findings indicate that the discrepancy between training and inference is the primary hindrance to the effective training of RNNs. To address this, we propose applying non-uniform weights to the loss computed at each time step, which allows the RNN model to learn from the predictions made in an environment that better resembles the inference stage. Extensive experiments on three benchmark datasets, THUMOS, TVSeries, and FineAction demonstrate that a minimal RNN-based model trained with the proposed methodology performs equally or better than the existing best methods with a significant increase in efficiency. The code is available at https://github.com/jbistanbul/MiniROAD.

引用

页码：10307 / 10316

页数：10

共 44 条

[1]

[Anonymous], 2014, Empirical evaluation of gated recurrent neural networks on sequence modeling

[2] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[3]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[5]

Chen J., 2022, IEEE C COMP VIS PATT

[6]

Cho K., 2014, Learning phrase representations using RNN encoderdecoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

[7]

2014, DOI DOI 10.3115/V1/D14-1179

[8] Modeling temporal structure with LSTM for online action detection [J].

De Geest, Roeland ;

Tuytelaars, Tinne .

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :1549-1557

[9] Online Action Detection [J].

De Geest, Roeland ;

Gavves, Efstratios ;

Ghodrati, Amir ;

Li, Zhenyang ;

Snoek, Cees ;

Tuytelaars, Tinne .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :269-284

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 →