Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment

被引：0

作者：

Xu, Huangbiao ^{[1
,2
]}

Ke, Xiao ^{[1
,2
]}

Li, Yuezhou ^{[1
,2
]}

Xu, Rui ^{[1
,2
]}

Wu, Huanqi ^{[1
,2
]}

Lin, Xiaofeng ^{[1
,2
]}

Guo, Wenzhong ^{[1
,2
]}

机构：

[1] Fuzhou Univ, Coll Comp & Data Sci, Fuzhou 350108, Peoples R China

[2] Minist Educ, Engn Res Ctr Big Data Intelligence, Beijing, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT XLII | 2025年 / 15100卷

基金：

中国国家自然科学基金;

关键词：

Action quality assessment; Vision-language pre-training; Semantic-aware learning; VIDEO; MODELS; NETWORK; SKILLS;

D O I：

10.1007/978-3-031-72946-1_24

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Action quality assessment (AQA) is a challenging vision task that requires discerning and quantifying subtle differences in actions from the same class. While recent research has made strides in creating fine-grained annotations for more precise analysis, existing methods primarily focus on coarse action segmentation, leading to limited identification of discriminative action frames. To address this issue, we propose a Vision-Language Action Knowledge Learning approach for action quality assessment, along with a multi-grained alignment framework to understand different levels of action knowledge. In our framework, prior knowledge, such as specialized terminology, is embedded into video-level, stage-level, and frame-level representations via CLIP. We further propose a new semantic-aware collaborative attention module to prevent confusing interactions and preserve textual knowledge in cross-modal and cross-semantic spaces. Specifically, we leverage the powerful cross-modal knowledge of CLIP to embed textual semantics into image features, which then guide action spatial-temporal representations. Our approach can be plug-and-played with existing AQA methods, frame-wise annotations or not. Extensive experiments and ablation studies show that our approach achieves state-of-the-art on four public short and long-term AQA benchmarks: FineDiving, MTL-AQA, JIGSAWS, and Fis-V.

引用

页码：423 / 440

页数：18

共 54 条

[1] Alayrac JB, 2022, ADV NEUR IN
[2] Action Quality Assessment with Temporal Parsing Transformer
Bai, Yang
Zhou, Desen
Zhang, Songyang
Wang, Jian
Ding, Errui
Guan, Yu
Long, Yang
Wang, Jingdong
[J]. COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 422 - 438
[3] Am I a Baller? Basketball Performance Assessment from First-Person Videos
Bertasius, Gedas
Park, Hyun Soo
Yu, Stella X.
Shi, Jianbo
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2196 - 2204
[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[5] Decoupling Zero-Shot Semantic Segmentation
Ding, Jian
Xue, Nan
Xia, Gui-Song
Dai, Dengxin
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11573 - 11582
[6] Learning and fusing multiple hidden substages for action quality assessment
Dong, Li-Jia
Zhang, Hong-Bo
Shi, Qinghongya
Lei, Qing
Du, Ji-Xiang
Gao, Shangce
[J]. KNOWLEDGE-BASED SYSTEMS, 2021, 229
[7] Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
Dong, Sixun
Hu, Huazhang
Lian, Dongze
Luo, Weixin
Qian, Yicheng
Gao, Shenghua
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2437 - 2447
[8] The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos
Doughty, Hazel
Mayol-Cuevas, Walterio
Damen, Dima
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7854 - 7863
[9] Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination
Doughty, Hazel
Damen, Dima
Mayol-Cuevas, Walterio
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6057 - 6066
[10] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497

← 1 2 3 4 5 6 →