Coarse-Fine Nested Network for Weakly Supervised Group Activity Recognition

被引:0
作者
Ge, Xiaojing [1 ,2 ]
Yan, Rui [3 ]
Shu, Xiangbo [4 ]
Chen, Keke [4 ]
Tian, Wei [5 ]
Xie, Guo-Sen [4 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[2] Anhui Jianzhu Univ, Anhui Prov Key Lab Intelligent Bldg & Bldg Energy, Hefei 230009, Peoples R China
[3] Nanjing Univ, Dept Comp Sci & Technol, Nanjing 210023, Peoples R China
[4] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[5] South Cent Minzu Univ, Sch Comp Sci, Wuhan 430074, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Feature extraction; Visualization; Spatiotemporal phenomena; Transformers; Detectors; Activity recognition; Data mining; Attention; transformer; video understanding; weakly supervised group activity recognition (WSGAR);
D O I
10.1109/TNNLS.2024.3401608
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly supervised group activity recognition (WSGAR) aims at identifying the overall behavior of multiple persons without any fine-grained supervision information (including individual position and action label). Traditional methods usually adopt a person-to-whole way: detect persons via off-the-shelf detectors, obtain person-level features, and integrate into the group-level features for training the classifier. However, these methods are unflexible due to serious reliance on the quality of detectors. To get rid of the detector, recent works learn several prototype tokens from noisy grid features with learnable weights directly, which treat all the local visual information equally and bring in redundant and ambiguous information to some extent. To this end, we propose a novel coarse-fine nested network (CFNN) to coarsely localize the key visual patches of activity and further finely learn the local features, as well as the global features. Specifically, we design a nested interactor (NI) to progressively model the spatiotemporal interactions of the learnable global token. According to the cue of spatial interaction in NI, we localize several key visual patches via a new coarse-grained spatial localizer (CSL). Then, we finally encode these localized visual patches with the help of global spatiotemporal dependency via a new fine-grained spatiotemporal selector (FSS). Extensive experiments on Volleyball and NBA datasets demonstrate the effectiveness of the proposed CFNN compared with the existing competitive methods.
引用
收藏
页码:7103 / 7115
页数:13
相关论文
共 58 条
  • [1] An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition
    Alfasly, Saghir
    Chui, Charles K.
    Jiang, Qingtang
    Lu, Jian
    Xu, Chen
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 2496 - 2509
  • [2] Sum Product Networks for Activity Recognition
    Amer, Mohamed R.
    Todorovic, Sinisa
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (04) : 800 - 813
  • [3] Monte Carlo Tree Search for Scheduling Activity Recognition
    Amer, Mohamed R.
    Todorovic, Sinisa
    Fern, Alan
    Zhu, Song-Chun
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 1353 - 1360
  • [4] Amer MR, 2012, LECT NOTES COMPUT SC, V7575, P187, DOI 10.1007/978-3-642-33765-9_14
  • [5] Amer MR, 2014, LECT NOTES COMPUT SC, V8694, P572, DOI 10.1007/978-3-319-10599-4_37
  • [6] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [7] Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition
    Bagautdinov, Timur
    Alahi, Alexandre
    Fleuret, Francois
    Fua, Pascal
    Savarese, Silvio
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 3425 - 3434
  • [8] Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition
    Deng, Zhiwei
    Vandat, Arash
    Hu, Hexiang
    Mori, Greg
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4772 - 4781
  • [9] DongLi Wang, 2022, ICMLC 2022: 2022 14th International Conference on Machine Learning and Computing (ICMLC), P401, DOI 10.1145/3529836.3529899
  • [10] Dosovitskiy A., 2021, 9 INT C LEARN REPR I, DOI DOI 10.48550/ARXIV.2010.11929