Semantic2Graph: graph-based multi-modal feature fusion for action segmentation in videos

被引:0
作者
Junbin Zhang
Pei-Hsuan Tsai
Meng-Hsun Tsai
机构
[1] National Cheng Kung University,Department of Computer Science and Information Engineering
[2] National Cheng Kung University,Institute of Manufacturing Information and Systems
[3] National Yang Ming Chiao Tung University,Department of Computer Science
来源
Applied Intelligence | 2024年 / 54卷
关键词
Video action segmentation; Graph neural networks; Computer vision; Semantic features; Multi-modal fusion;
D O I
暂无
中图分类号
学科分类号
摘要
Video action segmentation have been widely applied in many fields. Most previous studies employed video-based vision models for this purpose. However, they often rely on a large receptive field, LSTM or Transformer methods to capture long-term dependencies within videos, leading to significant computational resource requirements. To address this challenge, graph-based model was proposed. However, previous graph-based models are less accurate. Hence, this study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos, thereby reducing computational costs and raise the accuracy. We construct a graph structure of video at the frame-level. Temporal edges are utilized to model the temporal relations and action order within videos. Additionally, we have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions. Node attributes encompass a rich set of multi-modal features extracted from video content, graph structures, and label text, encompassing visual, structural, and semantic cues. To synthesize this multi-modal information effectively, we employ a graph neural network (GNN) model to fuse multi-modal features for node action label classification. Experimental results demonstrate that Semantic2Graph outperforms state-of-the-art methods in terms of performance, particularly on benchmark datasets such as GTEA and 50Salads. Multiple ablation experiments further validate the effectiveness of semantic features in enhancing model performance. Notably, the inclusion of semantic edges in Semantic2Graph allows for the cost-effective capture of long-term dependencies, affirming its utility in addressing the challenges posed by computational resource constraints in video-based vision models.
引用
收藏
页码:2084 / 2099
页数:15
相关论文
共 85 条
[1]  
Zeng RH(2021)Graph convolutional module for temporal action localization in videos IEEE Trans Pattern Anal Mach Intell 32 5281-5292
[2]  
Huang WB(2022)Expansion-squeeze-excitation fusion network for elderly activity recognition IEEE Trans Circuits Syst Video Technol 35 2729-2737
[3]  
Tan MK(2021)Temporal relational modeling with self-supervision for action segmentation Proc AAAI Conf Artif Intell (AAAI) 30 2872-2886
[4]  
Rong Y(2020)Graph interaction networks for relation transfer in human activity videos IEEE Trans Circuits Syst Video Technol 24 1678-1690
[5]  
Zhao PL(2021)Multi-localized sensitive auto-encoder-attention-lstm for skeleton-based action recognition IEEE Trans Multimedia 32 1250-1261
[6]  
Huang JZ(2022)Spatiotemporal multimodal learning with 3D CNNs for video action recognition IEEE Trans Circuits Syst Video Technol 109 104415-443
[7]  
Gan C(2021)Boundary graph convolutional network for temporal action detection Image Vis Comput 143 423-5224
[8]  
Shu X(2023)Multimodal learning on graphs for disease relation extraction J Biomed Inform 41 5213-24
[9]  
Yang J(2018)Multimodal machine learning: a survey and taxonomy IEEE Trans Pattern Anal Mach In-tell 32 4-105644
[10]  
Yan R(2022)Learning semantic-aware spatial-temporal attention for interpretable action recognition IEEE Trans Circuits Syst Video Technol 32 105634-8277