Procedure-Aware Action Quality Assessment: Datasets and Performance Evaluation

被引:0
作者
Xu, Jinglin [1 ,2 ]
Rao, Yongming [2 ]
Zhou, Jie [2 ]
Lu, Jiwen [2 ]
机构
[1] Univ Sci & Technol Beijing, Sch Intelligence Sci & Technol, Beijing, Peoples R China
[2] Tsinghua Univ, Dept Automat, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Action quality assessment; Fine-grained sports video dataset; Action procedure; Visual interpretability; Generalization ability;
D O I
10.1007/s11263-024-02146-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we investigate the problem of procedure-aware action quality assessment, which analyzes the action quality by delving into the semantic and spatial-temporal relationships among various composed steps of the action. Most existing action quality assessment methods regress on deep features of entire videos to learn diverse scores, which ignore the relationships among different fine-grained steps in actions and result in limitations in visual interpretability and generalization ability. To address these issues, we construct a fine-grained competitive sports video dataset called FineDiving with detailed semantic and temporal annotations, which helps understand the internal structures of each action. We also propose a new approach (i.e., spatial-temporal segmentation attention, STSA) that introduces procedure segmentation to parse an action into consecutive steps, learns powerful representations from these steps by constructing spatial motion attention and procedure-aware cross-attention, and designs a fine-grained contrastive regression to achieve an interpretable scoring mechanism. In addition, we build a benchmark on the FineDiving dataset to evaluate the performance of representative action quality assessment methods. Then, we expand FineDiving to FineDiving+ and construct three new benchmarks to investigate the transferable abilities between different diving competitions, between synchronized and individual dives, and between springboard and platform dives to demonstrate the generalization abilities of our STSA in unknown scenarios, scoring rules, action types, and difficulty degrees. Extensive experiments demonstrate that our approach, designed for procedure-aware action quality assessment, achieves substantial improvements. Our dataset and code are available at https://github.com/xujinglin/FineDiving.
引用
收藏
页码:6069 / 6090
页数:22
相关论文
共 70 条
[1]  
Ba J, 2014, ACS SYM SER
[2]   Action Quality Assessment with Temporal Parsing Transformer [J].
Bai, Yang ;
Zhou, Desen ;
Zhang, Songyang ;
Wang, Jian ;
Ding, Errui ;
Guan, Yu ;
Long, Yang ;
Wang, Jingdong .
COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 :422-438
[3]   Am I a Baller? Basketball Performance Assessment from First-Person Videos [J].
Bertasius, Gedas ;
Park, Hyun Soo ;
Yu, Stella X. ;
Shi, Jianbo .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2196-2204
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   SportsCap: Monocular 3D Human Motion Capture and Fine-Grained Understanding in Challenging Sports Videos [J].
Chen, Xin ;
Pang, Anqi ;
Yang, Wei ;
Ma, Yuexin ;
Xu, Lan ;
Yu, Jingyi .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (10) :2846-2864
[6]  
Dosovitskiy A, 2021, INT C LEARN REPR ICL
[7]   The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos [J].
Doughty, Hazel ;
Mayol-Cuevas, Walterio ;
Damen, Dima .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7854-7863
[8]   Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination [J].
Doughty, Hazel ;
Damen, Dima ;
Mayol-Cuevas, Walterio .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6057-6066
[9]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[10]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941