Enhancing action discrimination via category-specific frame clustering for weakly-supervised temporal action localization

被引:0
作者
Xia, Huifen [1 ,3 ]
Zhan, Yongzhao [1 ,2 ]
Liu, Honglin [1 ]
Ren, Xiaopeng [1 ]
机构
[1] Jiangsu Univ, Sch Comp Sci & Commun Engn, Zhenjiang 212013, Peoples R China
[2] Jiangsu Engn Res Ctr Big Data Ubiquitous Percept &, Zhenjiang 212013, Peoples R China
[3] Changzhou Vocat Inst Mechatron Technol, Changzhou 213164, Peoples R China
基金
中国国家自然科学基金;
关键词
Weakly supervised; Temporal action localization; Single-frame annotation; Category-specific; Action discrimination; DISTILLATION;
D O I
10.1631/FITEE.2300024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Temporal action localization (TAL) is a task of detecting the start and end timestamps of action instances and classifying them in an untrimmed video. As the number of action categories per video increases, existing weakly-supervised TAL (W-TAL) methods with only video-level labels cannot provide sufficient supervision. Single-frame supervision has attracted the interest of researchers. Existing paradigms model single-frame annotations from the perspective of video snippet sequences, neglect action discrimination of annotated frames, and do not pay sufficient attention to their correlations in the same category. Considering a category, the annotated frames exhibit distinctive appearance characteristics or clear action patterns. Thus, a novel method to enhance action discrimination via category-specific frame clustering for W-TAL is proposed. Specifically, the K-means clustering algorithm is employed to aggregate the annotated discriminative frames of the same category, which are regarded as exemplars to exhibit the characteristics of the action category. Then, the class activation scores are obtained by calculating the similarities between a frame and exemplars of various categories. Category-specific representation modeling can provide complimentary guidance to snippet sequence modeling in the mainline. As a result, a convex combination fusion mechanism is presented for annotated frames and snippet sequences to enhance the consistency properties of action discrimination, which can generate a robust class activation sequence for precise action classification and localization. Due to the supplementary guidance of action discriminative enhancement for video snippet sequences, our method outperforms existing single-frame annotation based methods. Experiments conducted on three datasets (THUMOS14, GTEA, and BEOID) show that our method achieves high localization performance compared with state-of-the-art methods.
引用
收藏
页码:809 / 823
页数:15
相关论文
共 99 条
[1]  
[Anonymous], 2020, Space-Integr-Ground Inform Netw, DOI DOI 10.1002/CMDC.200700300
[2]  
[Anonymous], 2023, PROC IEEE INT C SENS, DOI DOI 10.1109/ICSECE58870.2023.10263464
[3]   VeSoNet: Traffic-Aware Content Caching for Vehicular Social Networks Using Deep Reinforcement Learning [J].
Aung, Nyothiri ;
Dhelim, Sahraoui ;
Chen, Liming ;
Lakas, Abderrahmane ;
Zhang, Wenyin ;
Ning, Huansheng ;
Chaib, Souleyman ;
Kechadi, Mohand Tahar .
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2023, 24 (08) :8638-8649
[4]   Data-Centric Service-Based Architecture for Edge-Native 6G Network [J].
Baldoni, Gabriele ;
Quevedo, Jose ;
Guimaraes, Carlos ;
de la Oliva, Antonio ;
Corsaro, Angelo .
IEEE COMMUNICATIONS MAGAZINE, 2024, 62 (04) :32-38
[5]   Network-Coding Approach for Information-Centric Networking [J].
Bilal, Muhammad ;
Kang, Shin-Gak .
IEEE SYSTEMS JOURNAL, 2019, 13 (02) :1376-1385
[6]   A Comparative Study on Routing Protocols: RIP, OSPF and EIGRP and Their Analysis Using GNS-3 [J].
Biradar, Ambresh G. .
2020 5TH IEEE INTERNATIONAL CONFERENCE ON RECENT ADVANCES AND INNOVATIONS IN ENGINEERING (IEEE - ICRAIE-2020), 2020,
[7]   Finding Actors and Actions in Movies [J].
Bojanowski, P. ;
Bach, F. ;
Laptev, I. ;
Ponce, J. ;
Schmid, C. ;
Sivic, J. .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :2280-2287
[8]  
Bojanowski P, 2014, LECT NOTES COMPUT SC, V8693, P628, DOI 10.1007/978-3-319-10602-1_41
[9]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[10]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139