Action Quality Assessment with Temporal Parsing Transformer

被引:42
作者
Bai, Yang [1 ]
Zhou, Desen [2 ]
Zhang, Songyang [3 ]
Wang, Jian [2 ]
Ding, Errui [2 ]
Guan, Yu
Long, Yang [1 ]
Wang, Jingdong [2 ]
机构
[1] Univ Durham, Dept Comp Sci, Durham, England
[2] Baidu Inc, Dept Comp Vis Technol VIS, Beijing, Peoples R China
[3] Shanghai AI Lab, Shanghai, Peoples R China
来源
COMPUTER VISION - ECCV 2022, PT IV | 2022年 / 13664卷
关键词
Action quality assessment; Temporal parsing transformer; Temporal patterns; Contrastive regression;
D O I
10.1007/978-3-031-19772-7_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
引用
收藏
页码:422 / 438
页数:17
相关论文
共 35 条
[1]   Joint Discovery of Object States and Manipulation Actions [J].
Alayrac, Jean-Baptiste ;
Sivic, Josef ;
Laptev, Ivan ;
Lacoste-Julien, Simon .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2146-2155
[2]   Am I a Baller? Basketball Performance Assessment from First-Person Videos [J].
Bertasius, Gedas ;
Park, Hyun Soo ;
Yu, Stella X. ;
Shi, Jianbo .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2196-2204
[3]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos [J].
Doughty, Hazel ;
Mayol-Cuevas, Walterio ;
Damen, Dima .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7854-7863
[6]   Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination [J].
Doughty, Hazel ;
Damen, Dima ;
Mayol-Cuevas, Walterio .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6057-6066
[7]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[8]  
Gao Yan., 2006, POWER ELECT MOTION C, V1, P1
[9]  
Gordon A.S., 1995, P AI, V2
[10]  
Jug M, 2003, LECT NOTES COMPUT SC, V2626, P534