POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization

被引:23
作者
Wang, Binglu [1 ]
Yang, Le [1 ]
Zhao, Yongqiang [1 ]
机构
[1] Northwestern Polytech Univ, Sch Automat, Xian 710072, Peoples R China
关键词
Videos; Location awareness; Convolution; Training; Feature extraction; Task analysis; Kernel; Feature fusion; frame-wise attention; mutual attention; temporal action localization; NETWORK;
D O I
10.1109/LSP.2021.3061289
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Temporal action localization aims at discovering action instances in untrimmed videos, where RGB and flow are two widely used feature modalities. Specifically, RGB chiefly reveals appearance and flow mainly depicts motion. Given RGB and flow features, previous methods employ the early fusion or late fusion paradigm to mine the complementarity between them. By concatenating raw RGB and flow features, the early fusion implicitly achieved complementarity by the network, but it partly discards the particularity of each modality. The late fusion independently maintains two branches to explore the particularity of each modality, but it only fuses the localization results, which is insufficient to mine the complementarity. In this work, we propose explicit cross-modality fusion (POLO) to effectively utilize the complementarity between two modalities and thoroughly explore the particularity of each modality. POLO performs cross-modality fusion via estimating the attention weight from RGB modality and employing it to flow modality (vice versa). This bridges the complementarity of one modality to supply the other. Assisted with the attention weight, POLO independently learns from RGB and flow features and explores the particularity of each modality. Extensive experiments on two benchmarks demonstrate the preferable performance of POLO.
引用
收藏
页码:503 / 507
页数:5
相关论文
共 37 条
[1]  
[Anonymous], 2018, PROC EAR C COMPUT VI
[2]  
[Anonymous], THUMOS CHALLENGE ACT
[3]   Boundary Content Graph Neural Network for Temporal Action Proposal Generation [J].
Bai, Yueran ;
Wang, Yingying ;
Tong, Yunhai ;
Yang, Yang ;
Liu, Qiyue ;
Liu, Junhui .
COMPUTER VISION - ECCV 2020, PT XXVIII, 2020, 12373 :121-137
[4]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[5]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[6]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139
[7]   Pairwise Two-Stream ConvNets for Cross-Domain Action Recognition With Small Data [J].
Gao, Zan ;
Guo, Leming ;
Ren, Tongwei ;
Liu, An-An ;
Cheng, Zhi-Yong ;
Chen, Shengyong .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (03) :1147-1161
[8]   A Pairwise Attentive Adversarial Spatiotemporal Network for Cross-Domain Few-Shot Action Recognition-R2 [J].
Gao, Zan ;
Guo, Leming ;
Guan, Weili ;
Liu, Anan ;
Ren, Tongwei ;
Chen, Shengyong .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :767-782
[9]   Fast R-CNN [J].
Girshick, Ross .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448
[10]   DECOUPLING LOCALIZATION AND CLASSIFICATION IN SINGLE SHOT TEMPORAL ACTION DETECTION [J].
Huang, Yupan ;
Dai, Qi ;
Lu, Yutong .
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, :1288-1293