KFC: An Efficient Framework for Semi-Supervised Temporal Action Localization

被引:18
作者
Ding, Xinpeng [1 ]
Wang, Nannan [2 ]
Gao, Xinbo [3 ]
Li, Jie [1 ]
Wang, Xiaoyu [4 ]
Liu, Tongliang [5 ]
机构
[1] Xidian Univ, State Key Lab Integrated Serv Networks, Sch Elect Engn, Xian 710071, Shaanxi, Peoples R China
[2] Xidian Univ, Sch Telecommun Engn, State Key Lab Integrated Serv Networks, Xian 710071, Shaanxi, Peoples R China
[3] Chongqing Univ Posts & Telecommun, Chongqing Key Lab Image Cognit, Chongqing 400065, Peoples R China
[4] Chinese Univ Hong Kong, Sch Sci & Engn, Shenzhen 518172, Peoples R China
[5] Univ Sydney, Fac Engn, Sch Comp Sci, Trustworthy Machine Learning Lab, Sydney, NSW 2006, Australia
基金
中国国家自然科学基金;
关键词
Perturbation methods; Location awareness; Feature extraction; Training; Annotations; Semisupervised learning; Semantics; Temporal action localization; semi-supervised learning; video understanding;
D O I
10.1109/TIP.2021.3099407
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In temporal action localization (TAL), semi-supervised learning is a promising technique to mitigate the cost of precise boundary annotations. Semi-supervised approaches employing consistency regularization (CR), encouraging models to be robust to the perturbed inputs, have achieved great success in image classification problems. The success of CR is largely depended on the perturbations, where instances are perturbed to train a robust model without altering their semantic information. However, the perturbations for image or video classification tasks are not fit to apply to TAL. Since videos in TAL are too long to train the model with raw videos in an end-to-end manner. In this paper, we devise a method named K-farthest crossover to construct perturbations based on video features and apply it to TAL. Motivated by the observation that features in the same action instance become more and more similar during the training process while those in different action instances or backgrounds become more and more divergent, we add perturbations to each feature along temporal axis and adopt CR to encourage the model to retain this observation. Specifically, for a feature, we first find the top-k dissimilar features and average them to form a perturbation. Then, similar to chromosomal crossover, we select a large part of the feature and a small part of the perturbation to recombine a perturbed feature, which preserves the feature semantics yet enough discrepancy.
引用
收藏
页码:6869 / 6878
页数:10
相关论文
共 46 条
[1]  
Aila T., 2016, P INT C LEARN REPR
[2]  
Alwassel H., 2018, ECCV, P251
[3]  
[Anonymous], 2016, LECT NOTES COMPUT SC
[4]  
Berthelot D, 2019, ADV NEUR IN, V32
[5]   Soft-NMS - Improving Object Detection With One Line of Code [J].
Bodla, Navaneeth ;
Singh, Bharat ;
Chellappa, Rama ;
Davis, Larry S. .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5562-5570
[6]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]  
Cui Zhicheng, 2016, MULTISCALE CONVOLU T
[9]  
Grandvalet Y., 2005, Advances in Neural Information Processing Systems, P529
[10]  
Guennou-Martin A., 2016, 2016 IEEE Conference on Antenna Measurements Applications (CAMA), P1, DOI DOI 10.1109/CAMA.2016.7815802