MUP: Multi-granularity Unified Perception for Panoramic Activity Recognition

被引:3
作者
Cao, Meiqi [1 ]
Yan, Rui [2 ]
Shu, Xiangbo [1 ]
Zhang, Jiachao [3 ]
Wang, Jinpeng [4 ]
Xie, Guo-Sen [1 ]
机构
[1] Nanjing Univ Sci & Technol, Nanjing, Peoples R China
[2] Nanjing Univ, Nanjing, Peoples R China
[3] Nanjing Inst Technol, Nanjing, Peoples R China
[4] Natl Univ Singapore, Singapore, Singapore
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
国家重点研发计划; 中国国家自然科学基金; 中国博士后科学基金;
关键词
Action Recognition; Semantic Aggregation; Hierarchical Learning;
D O I
10.1145/3581783.3612435
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Panoramic activity recognition is required to jointly identify multi-granularity human behaviors including individual actions, group activities, and global activities in multi-person videos. Previous methods encode these behaviors hierarchically through multiple stages, which disturb the inherent co-occurrence across multigranularity behaviors in the same scene. To this end, we propose a novel Multi-granularity Unified Perception (MUP) framework that perceives different granularity behaviors universally to explore the co-occurrence motion pattern via the same parameters in an end-to-end fashion. To be specific, the proposed framework stacks three Unified Motion Encoding (UME) blocks for modeling multiple granularity behaviors with shared parameters. UME block mines intra-relevant and cross-relevant semantics synchronously from input feature sequences via Intra-granularity Motion Embedding (IME) and Cross-granularity Motion Prototyping (CMP). In particular, IME aims to model the interactions among visual features within each granularity based on the attention mechanism. CMP aims to aggregate features across different granularities (i.e., person to group) via several learnable prototypes. Extensive experiments demonstrate that MUP outperforms the state-of-the-art methods on JRDB-PAR and has satisfactory interpretability.
引用
收藏
页码:7666 / 7675
页数:10
相关论文
共 81 条
  • [1] Monte Carlo Tree Search for Scheduling Activity Recognition
    Amer, Mohamed R.
    Todorovic, Sinisa
    Fern, Alan
    Zhu, Song-Chun
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 1353 - 1360
  • [2] Amer MR, 2012, LECT NOTES COMPUT SC, V7575, P187, DOI 10.1007/978-3-642-33765-9_14
  • [3] Amer MR, 2014, LECT NOTES COMPUT SC, V8694, P572, DOI 10.1007/978-3-319-10599-4_37
  • [4] [Anonymous], 2009, P IEEE INT C COMP VI
  • [5] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [6] Convolutional Relational Machine for Group Activity Recognition
    Azar, Sina Mokhtarzadeh
    Atigh, Mina Ghadimi
    Nickabadi, Ahmad
    Alahi, Alexandre
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7884 - 7893
  • [7] Ba J. L., 2016, Layer Normalization
  • [8] Bertasius G, 2021, PR MACH LEARN RES, V139
  • [9] Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition
    Bian, Cunling
    Feng, Wei
    Wang, Song
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5990 - 5998
  • [10] Campbell D., 2021, NeurIPS, V34, P12493