Weakly-Supervised Temporal Action Alignment Driven by Unbalanced Spectral Fused Gromov-Wasserstein Distance

被引：4

作者：

Luo, Dixin ^{[1
]}

Wang, Yutong ^{[1
]}

Yue, Angxiao ^{[1
]}

Xu, Hongteng ^{[2
]}

机构：

[1] Beijing Inst Technol, Sch Comp Sci & Technol, Beijing, Peoples R China

[2] Renmin Univ China, Gaoling Sch Artif Intelligence, Beijing Key Lab Big Data Management & Anal Method, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

Temporal action alignment; weakly-supervised learning; computational optimal transport; autoencoding; contrastive learning;

D O I：

10.1145/3503161.3548067

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Temporal action alignment aims at segmenting videos into clips and tagging each clip with a textual description, which is an important task of video semantic analysis. Most existing methods, however, rely on supervised learning to train their alignment models, whose applications are limited because of the common insufficiency issue of labeled videos. To mitigate this issue, we propose a weakly-supervised temporal action alignment method based on a novel computational optimal transport technique called unbalanced spectral fused Gromov-Wasserstein (US-FGW) distance. Instead of using videos with known clips and corresponding textual tags, our method just needs each training video to be associated with a set of (unsorted) texts while does not require the fine-grained correspondence between the frames and the texts. Given such weakly-supervised video-text pairs, our method trains the representation models of the video frames and the texts jointly in a probabilistic or deterministic autoencoding architecture and penalizes the US-FGW distance between the distribution of visual latent codes and that of textual latent codes. We compute the US-FGW distance efficiently by leveraging the Bregman ADMM algorithm. Furthermore, we generalize classic contrastive learning framework and reformulate it based on the proposed US-FGW distance, which provides a new viewpoint of contrastive learning for our problem. Experimental results show that our method and its variants outperform state-of-the-art weakly-supervised temporal action alignment methods, whose results are even comparable to those derived by supervised learning methods on some specific evaluation measurements. The code is available at https://github.com/hhhh1138/Temporal- Action-Alignment-USFGW.

引用

页数：12

共 73 条

[1] MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation [J].

Abu Farha, Yazan ;

Gall, Juergen .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3570-3579

[2]

Arjovsky M, 2017, PR MACH LEARN RES, V70

[3] An Automated Method for Identification of Key frames in Bharatanatyam Dance Videos [J].

Bhuyan, Himadri ;

Das, Partha Pratim ;

Dash, Jatindra Kumar ;

Killi, Jagadeesh .

IEEE ACCESS, 2021, 9 :72670-72680

[4]

Bojanowski P, 2014, LECT NOTES COMPUT SC, V8693, P628, DOI 10.1007/978-3-319-10602-1_41

[5] SPOT Sliced Partial Optimal Transport [J].

Bonneel, Nicolas ;

Coeurjolly, David .

ACM TRANSACTIONS ON GRAPHICS, 2019, 38 (04)

[6] Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos [J].

Chen, Brian ;

Rouditchenko, Andrew ;

Duarte, Kevin ;

Kuehne, Hilde ;

Thomas, Samuel ;

Boggust, Angie ;

Panda, Rameswar ;

Kingsbury, Brian ;

Feris, Rogerio ;

Harwath, David ;

Glass, James ;

Picheny, Michael ;

Chang, Shih-Fu .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :7992-8001

[7]

Chen Liqun, 2018, INT C LEARN REPR

[8] The Gromov-Wasserstein distance between networks and stable network invariants [J].

Chowdhury, Samir ;

Memoli, Facundo .

INFORMATION AND INFERENCE-A JOURNAL OF THE IMA, 2019, 8 (04) :757-787

[9]

Cuturi M., 2013, Advances in Neural Information Processing Systems, V26, P2292

[10] RPCA-KFE: Key Frame Extraction for Video Using Robust Principal Component Analysis [J].

Dang, Chinh ;

Radha, Hayder .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (11) :3742-3753

← 1 2 3 4 5 6 7 8 →