Vision-Language Action Knowledge Learning for Semantic-Aware Action Quality Assessment

被引:0
作者
Xu, Huangbiao [1 ,2 ]
Ke, Xiao [1 ,2 ]
Li, Yuezhou [1 ,2 ]
Xu, Rui [1 ,2 ]
Wu, Huanqi [1 ,2 ]
Lin, Xiaofeng [1 ,2 ]
Guo, Wenzhong [1 ,2 ]
机构
[1] Fuzhou Univ, Coll Comp & Data Sci, Fuzhou 350108, Peoples R China
[2] Minist Educ, Engn Res Ctr Big Data Intelligence, Beijing, Peoples R China
来源
COMPUTER VISION - ECCV 2024, PT XLII | 2025年 / 15100卷
基金
中国国家自然科学基金;
关键词
Action quality assessment; Vision-language pre-training; Semantic-aware learning; VIDEO; MODELS; NETWORK; SKILLS;
D O I
10.1007/978-3-031-72946-1_24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Action quality assessment (AQA) is a challenging vision task that requires discerning and quantifying subtle differences in actions from the same class. While recent research has made strides in creating fine-grained annotations for more precise analysis, existing methods primarily focus on coarse action segmentation, leading to limited identification of discriminative action frames. To address this issue, we propose a Vision-Language Action Knowledge Learning approach for action quality assessment, along with a multi-grained alignment framework to understand different levels of action knowledge. In our framework, prior knowledge, such as specialized terminology, is embedded into video-level, stage-level, and frame-level representations via CLIP. We further propose a new semantic-aware collaborative attention module to prevent confusing interactions and preserve textual knowledge in cross-modal and cross-semantic spaces. Specifically, we leverage the powerful cross-modal knowledge of CLIP to embed textual semantics into image features, which then guide action spatial-temporal representations. Our approach can be plug-and-played with existing AQA methods, frame-wise annotations or not. Extensive experiments and ablation studies show that our approach achieves state-of-the-art on four public short and long-term AQA benchmarks: FineDiving, MTL-AQA, JIGSAWS, and Fis-V.
引用
收藏
页码:423 / 440
页数:18
相关论文
共 54 条
  • [1] Alayrac JB, 2022, ADV NEUR IN
  • [2] Action Quality Assessment with Temporal Parsing Transformer
    Bai, Yang
    Zhou, Desen
    Zhang, Songyang
    Wang, Jian
    Ding, Errui
    Guan, Yu
    Long, Yang
    Wang, Jingdong
    [J]. COMPUTER VISION - ECCV 2022, PT IV, 2022, 13664 : 422 - 438
  • [3] Am I a Baller? Basketball Performance Assessment from First-Person Videos
    Bertasius, Gedas
    Park, Hyun Soo
    Yu, Stella X.
    Shi, Jianbo
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2196 - 2204
  • [4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [5] Decoupling Zero-Shot Semantic Segmentation
    Ding, Jian
    Xue, Nan
    Xia, Gui-Song
    Dai, Dengxin
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 11573 - 11582
  • [6] Learning and fusing multiple hidden substages for action quality assessment
    Dong, Li-Jia
    Zhang, Hong-Bo
    Shi, Qinghongya
    Lei, Qing
    Du, Ji-Xiang
    Gao, Shangce
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 229
  • [7] Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos
    Dong, Sixun
    Hu, Huazhang
    Lian, Dongze
    Luo, Weixin
    Qian, Yicheng
    Gao, Shenghua
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2437 - 2447
  • [8] The Pros and Cons: Rank-aware Temporal Attention for Skill Determination in Long Videos
    Doughty, Hazel
    Mayol-Cuevas, Walterio
    Damen, Dima
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 7854 - 7863
  • [9] Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination
    Doughty, Hazel
    Damen, Dima
    Mayol-Cuevas, Walterio
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6057 - 6066
  • [10] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497