Distributed Semantic Communications for Multimodal Audio-Visual Parsing Tasks

被引:0
|
作者
Wang, Penghong [1 ,2 ]
Li, Jiahui [3 ]
Liu, Chen [1 ,2 ]
Fan, Xiaopeng [1 ,2 ]
Ma, Mengyao [3 ]
Wang, Yaowei [2 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci, Harbin 150001, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Huawei, Wireless Technol Lab, Shenzhen 518129, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed semantic communication; deep joint source-channel coding; audio-visual parsing; auxiliary information feedback; CHANNEL; COMPRESSION;
D O I
10.1109/TGCN.2024.3374700
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Semantic communication has significantly improved in single-modal single-task scenarios, but its progress is limited in multimodal and multi-task transmission contexts. To address this issue, this paper investigates a distributed semantic communication system for audio-visual parsing (AVP) task. The system acquires audio-visual information from distributed terminals and conducts multi-task analysis on the far-end server, which involves event categorization and boundary recording. We propose a distributed deep joint source-channel coding scheme with auxiliary information feedback to implement this system, aiming to enhance parsing performance and reduce bandwidth consumption during communication. Specifically, the server initially receives the audio feature from the audio terminal and then sends the semantic information extracted from the audio feature back to the visual terminal. The received semantic and visual information are interactively processed by the visual terminal before being encoded and transmitted. The audio and visual semantic information received is processed and parsed on the far-end server. The experimental results demonstrate a significant reduction in transmission bandwidth consumption and notable performance improvements across various evaluation metrics for distributed AVP task compared to current state-of-the-art methods.
引用
收藏
页码:1707 / 1716
页数:10
相关论文
共 50 条
  • [1] DISTRIBUTED AUDIO-VISUAL PARSING BASED ON MULTIMODAL TRANSFORMER AND DEEP JOINT SOURCE CHANNEL CODING
    Wang, Penghong
    Li, Jiahui
    Ma, Mengyao
    Fan, Xiaopeng
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4623 - 4627
  • [2] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
    Mezaris, Vasileios
    Gidaros, Spyros
    Papadopoulos, Georgios Th.
    Kasper, Walter
    Steffen, Joerg
    Ordelman, Roeland
    Huijbregts, Marijn
    de Jong, Franciska
    Kompatsiaris, Ioannis
    Strintzis, Michael G.
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2010,
  • [3] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
    Vasileios Mezaris
    Spyros Gidaros
    GeorgiosTh Papadopoulos
    Walter Kasper
    Jörg Steffen
    Roeland Ordelman
    Marijn Huijbregts
    Franciska de Jong
    Ioannis Kompatsiaris
    MichaelG Strintzis
    EURASIP Journal on Advances in Signal Processing, 2010
  • [4] Semantic Audio-Visual Navigation
    Chen, Changan
    Al-Halah, Ziad
    Grauman, Kristen
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15511 - 15520
  • [5] Transfer of Audio-Visual Temporal Training to Temporal and Spatial Audio-Visual Tasks
    Suerig, Ralf
    Bottari, Davide
    Roeder, Brigitte
    MULTISENSORY RESEARCH, 2018, 31 (06) : 556 - 578
  • [6] Audio-visual event detection based on mining of semantic audio-visual labels
    Goh, KS
    Miyahara, K
    Radhakrishan, R
    Xiong, ZY
    Divakaran, A
    STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
  • [7] Audio-visual interaction in multimodal communication
    Chellappa, R
    Chen, TH
    Katsaggelos, A
    IEEE SIGNAL PROCESSING MAGAZINE, 1997, 14 (04) : 37 - 38
  • [8] Audio-visual integration in multimodal communication
    Chen, T
    Rao, RR
    PROCEEDINGS OF THE IEEE, 1998, 86 (05) : 837 - 852
  • [9] Distributed audio-visual content development
    Meliones, A
    Karidis, A
    Perrakis, S
    Siganos, V
    Skelton, C
    HIGH-PERFORMANCE COMPUTING AND NETWORKING, 1998, 1401 : 74 - 85
  • [10] Audio-Visual Learning for Multimodal Emotion Recognition
    Fan, Siyu
    Jing, Jianan
    Wang, Chongwen
    SYMMETRY-BASEL, 2025, 17 (03):