InteractNet: Social Interaction Recognition for Semantic-rich Videos

被引:0
|
作者
Lyu, Yuanjie [1 ]
Qin, Penggang [1 ]
Xu, Tong [1 ]
Zhu, Chen [1 ,2 ]
Chen, Enhong [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] BOSS Zhipin, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal analysis; video-and-language understanding; graph convo- lutional network;
D O I
10.1145/3663668
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The overwhelming surge of online video platforms has raised an urgent need for social interaction recognition techniques. Compared with simple short-term actions, long-term social interactions in semantic-rich videos could reflect more complicated semantics such as character relationships or emotions, which will better support various downstream applications, e.g., story summarization and fine-grained clip retrieval. However, considering the longer duration of social interactions with severe mutual overlap, involving multiple characters, dynamic scenes, and multi-modal cues, among other factors, traditional solutions for short-term action recognition may probably fail in this task. To address these challenges, in this article, we propose a hierarchical graph-based system, named InteractNet, to recognize social interactions in a multi-modal perspective. Specifically, our approach first generates a semantic graph for each sampled frame with integrating multi- modal cues and then learns the node representations as short-term interaction patterns via an adapted GCN module. Along this line, global interaction representations are accumulated through a sub-clip identification module, effectively filtering out irrelevant information and resolving temporal overlaps between interactions. In the end, the association among simultaneous interactions will be captured and modelled by constructing a global-level character-pair graph to predict the final social interactions. Comprehensive experiments on publicly available datasets demonstrate the effectiveness of our approach compared with state-of-the-art baseline methods.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] Semantic-Rich 3D CAD Models for Built Environments from Point Clouds: An End-to-End Procedure
    Perez-Perez, Yeritza
    Golparvar-Fard, Mani
    El-Rayes, Khaled
    COMPUTING IN CIVIL ENGINEERING 2017: INFORMATION MODELLING AND DATA ANALYTICS, 2017, : 166 - 174
  • [32] Socializing the Videos: A Multimodal Approach for Social Relation Recognition
    Xu, Tong
    Zhou, Peilun
    Hu, Linkang
    He, Xiangnan
    Hu, Yao
    Chen, Enhong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (01)
  • [33] Distantly Supervised Semantic Text Detection and Recognition for Broadcast Sports Videos Understanding
    Shah, Avijit
    Biswas, Topojoy
    Ramadoss, Sathish
    Shah, Deven Santosh
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1167 - 1175
  • [34] Zero and few shot action recognition in videos with caption semantic and generative assist
    Thrilokachandran G.
    Hosalli Ramappa M.
    International Journal of Information Technology, 2024, 16 (5) : 3121 - 3133
  • [35] Overall-Distinctive GCN for Social Relation Recognition on Videos
    Hu, Yibo
    Cao, Chenyu
    Li, Fangtao
    Yan, Chenghao
    Qi, Jinsheng
    Wu, Bin
    MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 57 - 68
  • [36] Human Interaction Recognition in Videos with Body Pose Traversal Analysis and Pairwise Interaction Framework
    Verma, Amit
    Meenpal, Toshanlal
    Acharya, Bibhudendra
    IETE JOURNAL OF RESEARCH, 2023, 69 (01) : 46 - 58
  • [37] Interactive Phrases: Semantic Descriptions for Human Interaction Recognition
    Kong, Yu
    Jia, Yunde
    Fu, Yun
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (09) : 1775 - 1788
  • [38] Fusing motion patterns and key visual information for semantic event recognition in basketball videos
    Wu, Lifang
    Yang, Zhou
    Wang, Qi
    Jian, Meng
    Zhao, Boxuan
    Yan, Junchi
    Chen, Chang Wen
    NEUROCOMPUTING, 2020, 413 : 217 - 229
  • [39] Real-Time Summarization of User-Generated Videos Based on Semantic Recognition
    Wang, Xi
    Jiang, Yu-Gang
    Chai, Zhenhua
    Gu, Zichen
    Du, Xinyu
    Wang, Dong
    PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 849 - 852
  • [40] Danmaku: A New Paradigm of Social Interaction via Online Videos
    Wu, Qunfang
    Sang, Yisi
    Huang, Yun
    ACM Transactions on Social Computing, 2019, 2 (02)