TRANSTL: SPATIAL-TEMPORAL LOCALIZATION TRANSFORMER FOR MULTI-LABEL VIDEO CLASSIFICATION

被引:4
作者
Wu, Hongjun [1 ]
Li, Mengzhu [1 ]
Liu, Yongcheng [2 ]
Liu, Hongzhe [1 ]
Xu, Cheng [1 ]
Li, Xuewei [1 ]
机构
[1] Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
基金
中国国家自然科学基金;
关键词
Multi-label Video Classification; Label Co-occurrence Dependency; Spatial Temporal Label Dependency; Transformer;
D O I
10.1109/ICASSP43922.2022.9747849
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multi-label video classification (MLVC) is a long-standing and challenging research problem in video signal analysis. Generally, there exist many complex action labels in real-world videos and these actions are with inherent dependencies at both spatial and temporal domains. Motivated by this observation, we propose TranSTL, a spatial-temporal localization Transformer framework for MLVC task. In addition to leverage global action label co-occurrence, we also propose a novel plug-and-play Spatial Temporal Label Dependency (STLD) layer in TranSTL. STLD not only dynamically models the label co-occurrence in a video by self-attention mechanism, but also fully captures spatial-temporal label dependencies using cross-attention strategy. As a result, our TranSTL is able to explicitly and accurately grasp the diverse action labels at both spatial and temporal domains. Extensive evaluation and empirical analysis show that TranSTL achieves superior performance over the state of the arts on two challenging benchmarks, Charades and Multi-Thumos.
引用
收藏
页码:1965 / 1969
页数:5
相关论文
共 21 条
[1]  
Ben-Baruch Emanuel, 2021, ICCV
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]   Multi-Label Image Recognition with Graph Convolutional Networks [J].
Chen, Zhao-Min ;
Wei, Xiu-Shen ;
Wang, Peng ;
Guo, Yanwen .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5172-5181
[4]  
Fan Haoqi, 2021, ARXIV210411227
[5]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[6]   Learning to Model Relationships for Zero-Shot Video Classification [J].
Gao, Junyu ;
Zhang, Tianzhu ;
Xu, Changsheng .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (10) :3476-3491
[7]   Timeception for Complex Action Recognition [J].
Hussein, Noureldien ;
Gavves, Efstratios ;
Smeulders, Arnold W. M. .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :254-263
[8]  
Hussein Noureldien, 2019, ICCV WORKSH SCEN GRA
[9]  
Kipf T. N., 2017, ICLR, P1, DOI https://doi.org/10.48550/arXiv.1609.02907
[10]  
Li X., 2021, ARXIV210411746