Distributed Semantic Communications for Multimodal Audio-Visual Parsing Tasks

被引:0
|
作者
Wang, Penghong [1 ,2 ]
Li, Jiahui [3 ]
Liu, Chen [1 ,2 ]
Fan, Xiaopeng [1 ,2 ]
Ma, Mengyao [3 ]
Wang, Yaowei [2 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci, Harbin 150001, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Huawei, Wireless Technol Lab, Shenzhen 518129, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed semantic communication; deep joint source-channel coding; audio-visual parsing; auxiliary information feedback; CHANNEL; COMPRESSION;
D O I
10.1109/TGCN.2024.3374700
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Semantic communication has significantly improved in single-modal single-task scenarios, but its progress is limited in multimodal and multi-task transmission contexts. To address this issue, this paper investigates a distributed semantic communication system for audio-visual parsing (AVP) task. The system acquires audio-visual information from distributed terminals and conducts multi-task analysis on the far-end server, which involves event categorization and boundary recording. We propose a distributed deep joint source-channel coding scheme with auxiliary information feedback to implement this system, aiming to enhance parsing performance and reduce bandwidth consumption during communication. Specifically, the server initially receives the audio feature from the audio terminal and then sends the semantic information extracted from the audio feature back to the visual terminal. The received semantic and visual information are interactively processed by the visual terminal before being encoded and transmitted. The audio and visual semantic information received is processed and parsed on the far-end server. The experimental results demonstrate a significant reduction in transmission bandwidth consumption and notable performance improvements across various evaluation metrics for distributed AVP task compared to current state-of-the-art methods.
引用
收藏
页码:1707 / 1716
页数:10
相关论文
共 50 条
  • [31] Label-Anticipated Event Disentanglement for Audio-Visual Video Parsing
    Zhou, Jinxing
    Guo, Dan
    Mao, Yuxin
    Zhong, Yiran
    Chang, Xiaojun
    Wang, Meng
    COMPUTER VISION - ECCV 2024, PT X, 2025, 15068 : 35 - 51
  • [32] Multimodal Dance Generation Networks Based on Audio-Visual Analysis
    Duan, Lijuan
    Xu, Xiao
    En, Qing
    INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2021, 12 (01): : 17 - 32
  • [33] Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition
    Ghaleb, Esam
    Popa, Mirela
    Asteriadis, Stylianos
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [34] EEG Guided Multimodal Lie Detection with Audio-Visual Cues
    Javaid, Hamza
    Dilawari, Aniqa
    Khan, Usman Ghani
    Wajid, Bilal
    PROCEEDINGS OF 2ND IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (ICAI 2022), 2022, : 71 - 78
  • [35] Multimodal Attentive Fusion Network for audio-visual event recognition
    Brousmiche, Mathilde
    Rouat, Jean
    Dupont, Stephane
    INFORMATION FUSION, 2022, 85 : 52 - 59
  • [36] Multimodal Emotion Recognition using Physiological and Audio-Visual Features
    Matsuda, Yuki
    Fedotov, Dmitrii
    Takahashi, Yuta
    Arakawa, Yutaka
    Yasumo, Keiichi
    Minker, Wolfgang
    PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, : 946 - 951
  • [37] Learning incremental audio-visual representation for continual multimodal understanding
    Zhu, Boqing
    Wang, Changjian
    Xu, Kele
    Feng, Dawei
    Zhou, Zemin
    Zhu, Xiaoqian
    KNOWLEDGE-BASED SYSTEMS, 2024, 304
  • [38] A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence
    Min, Xiongkuo
    Zhai, Guangtao
    Zhou, Jiantao
    Zhang, Xiao-Ping
    Yang, Xiaokang
    Guan, Xinping
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 3805 - 3819
  • [39] Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition
    Song, Qiya
    Sun, Bin
    Li, Shutao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10028 - 10038
  • [40] Audio-Visual Signal Processing in a Multimodal Assisted Living Environment
    Karpov, Alexey
    Akarun, Lale
    Yalcin, Hulya
    Ronzhin, Alexander
    Demiroz, Baris Evrim
    Coban, Aysun
    Zelezny, Milos
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1023 - 1027