MUP: Multi-granularity Unified Perception for Panoramic Activity Recognition

被引：3

作者：

Cao, Meiqi ^{[1
]}

Yan, Rui ^{[2
]}

Shu, Xiangbo ^{[1
]}

Zhang, Jiachao ^{[3
]}

Wang, Jinpeng ^{[4
]}

Xie, Guo-Sen ^{[1
]}

机构：

[1] Nanjing Univ Sci & Technol, Nanjing, Peoples R China

[2] Nanjing Univ, Nanjing, Peoples R China

[3] Nanjing Inst Technol, Nanjing, Peoples R China

[4] Natl Univ Singapore, Singapore, Singapore

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金; 中国博士后科学基金;

关键词：

Action Recognition; Semantic Aggregation; Hierarchical Learning;

D O I：

10.1145/3581783.3612435

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Panoramic activity recognition is required to jointly identify multi-granularity human behaviors including individual actions, group activities, and global activities in multi-person videos. Previous methods encode these behaviors hierarchically through multiple stages, which disturb the inherent co-occurrence across multigranularity behaviors in the same scene. To this end, we propose a novel Multi-granularity Unified Perception (MUP) framework that perceives different granularity behaviors universally to explore the co-occurrence motion pattern via the same parameters in an end-to-end fashion. To be specific, the proposed framework stacks three Unified Motion Encoding (UME) blocks for modeling multiple granularity behaviors with shared parameters. UME block mines intra-relevant and cross-relevant semantics synchronously from input feature sequences via Intra-granularity Motion Embedding (IME) and Cross-granularity Motion Prototyping (CMP). In particular, IME aims to model the interactions among visual features within each granularity based on the attention mechanism. CMP aims to aggregate features across different granularities (i.e., person to group) via several learnable prototypes. Extensive experiments demonstrate that MUP outperforms the state-of-the-art methods on JRDB-PAR and has satisfactory interpretability.

引用

页码：7666 / 7675

页数：10

共 81 条

[1] Monte Carlo Tree Search for Scheduling Activity Recognition
Amer, Mohamed R.
Todorovic, Sinisa
Fern, Alan
Zhu, Song-Chun
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 1353 - 1360
[2] Amer MR, 2012, LECT NOTES COMPUT SC, V7575, P187, DOI 10.1007/978-3-642-33765-9_14
[3] Amer MR, 2014, LECT NOTES COMPUT SC, V8694, P572, DOI 10.1007/978-3-319-10599-4_37
[4] [Anonymous], 2009, P IEEE INT C COMP VI
[5] ViViT: A Video Vision Transformer
Arnab, Anurag
Dehghani, Mostafa
Heigold, Georg
Sun, Chen
Lucic, Mario
Schmid, Cordelia
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
[6] Convolutional Relational Machine for Group Activity Recognition
Azar, Sina Mokhtarzadeh
Atigh, Mina Ghadimi
Nickabadi, Ahmad
Alahi, Alexandre
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7884 - 7893
[7] Ba J. L., 2016, Layer Normalization
[8] Bertasius G, 2021, PR MACH LEARN RES, V139
[9] Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition
Bian, Cunling
Feng, Wei
Wang, Song
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5990 - 5998
[10] Campbell D., 2021, NeurIPS, V34, P12493

← 1 2 3 4 5 6 7 8 9 →