Action Quality Assessment with Temporal Parsing Transformer

被引：42

作者：

Bai, Yang ^{[1
]}

Zhou, Desen ^{[2
]}

Zhang, Songyang ^{[3
]}

Wang, Jian ^{[2
]}

Ding, Errui ^{[2
]}

Guan, Yu

Long, Yang ^{[1
]}

Wang, Jingdong ^{[2
]}

机构：

[1] Univ Durham, Dept Comp Sci, Durham, England

[2] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China

[3] Shanghai AI Lab, Shanghai, Peoples R China

来源：

COMPUTER VISION - ECCV 2022, PT IV | 2022年 / 13664卷

关键词：

Action quality assessment; Temporal parsing transformer; Temporal patterns; Contrastive regression;

D O I：

10.1007/978-3-031-19772-7_25

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.

引用

页码：422 / 438

页数：17

共 35 条

[1] Joint Discovery of Object States and Manipulation Actions [J].

Alayrac, Jean-Baptiste ;

Sivic, Josef ;

Laptev, Ivan ;

Lacoste-Julien, Simon .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2146-2155

[2] Am I a Baller? Basketball Performance Assessment from First-Person Videos [J].

Bertasius, Gedas ;

Park, Hyun Soo ;

Yu, Stella X. ;

Shi, Jianbo .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2196-2204

[3] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[5] The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos [J].

Doughty, Hazel ;

Mayol-Cuevas, Walterio ;

Damen, Dima .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7854-7863

[6] Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination [J].

Doughty, Hazel ;

Damen, Dima ;

Mayol-Cuevas, Walterio .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6057-6066

[7] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[8]

Gao Yan., 2006, POWER ELECT MOTION C, V1, P1

[9]

Gordon A.S., 1995, P AI, V2

[10]

Jug M, 2003, LECT NOTES COMPUT SC, V2626, P534

← 1 2 3 4 →