Distributed Semantic Communications for Multimodal Audio-Visual Parsing Tasks

被引:0
|
作者
Wang, Penghong [1 ,2 ]
Li, Jiahui [3 ]
Liu, Chen [1 ,2 ]
Fan, Xiaopeng [1 ,2 ]
Ma, Mengyao [3 ]
Wang, Yaowei [2 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci, Harbin 150001, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Huawei, Wireless Technol Lab, Shenzhen 518129, Peoples R China
基金
中国国家自然科学基金;
关键词
Distributed semantic communication; deep joint source-channel coding; audio-visual parsing; auxiliary information feedback; CHANNEL; COMPRESSION;
D O I
10.1109/TGCN.2024.3374700
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
Semantic communication has significantly improved in single-modal single-task scenarios, but its progress is limited in multimodal and multi-task transmission contexts. To address this issue, this paper investigates a distributed semantic communication system for audio-visual parsing (AVP) task. The system acquires audio-visual information from distributed terminals and conducts multi-task analysis on the far-end server, which involves event categorization and boundary recording. We propose a distributed deep joint source-channel coding scheme with auxiliary information feedback to implement this system, aiming to enhance parsing performance and reduce bandwidth consumption during communication. Specifically, the server initially receives the audio feature from the audio terminal and then sends the semantic information extracted from the audio feature back to the visual terminal. The received semantic and visual information are interactively processed by the visual terminal before being encoded and transmitted. The audio and visual semantic information received is processed and parsed on the far-end server. The experimental results demonstrate a significant reduction in transmission bandwidth consumption and notable performance improvements across various evaluation metrics for distributed AVP task compared to current state-of-the-art methods.
引用
收藏
页码:1707 / 1716
页数:10
相关论文
共 50 条
  • [41] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [42] Catching audio-visual mice:: The extrapolation of audio-visual speed
    Hofbauer, MM
    Wuerger, SM
    Meyer, GF
    Röhrbein, F
    Schill, K
    Zetzsche, C
    PERCEPTION, 2003, 32 : 96 - 96
  • [43] The temporal dynamics of conscious and unconscious audio-visual semantic integration
    Gao, Mingjie
    Zhu, Weina
    Drewes, Jan
    HELIYON, 2024, 10 (13)
  • [44] Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing
    Wu, Yu
    Yang, Yi
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1326 - 1335
  • [45] Semantic indexing of sports program sequences by audio-visual analysis
    Leonardi, R
    Migliorati, P
    Prandini, M
    2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 1, PROCEEDINGS, 2003, : 9 - 12
  • [46] Semantic incongruity influences response caution in audio-visual integration
    Benjamin Steinweg
    Fred W. Mast
    Experimental Brain Research, 2017, 235 : 349 - 363
  • [47] Semantic incongruity influences response caution in audio-visual integration
    Steinweg, Benjamin
    Mast, Fred W.
    EXPERIMENTAL BRAIN RESEARCH, 2017, 235 (01) : 349 - 363
  • [48] Semantic audio-visual data fusion for automatic emotion recognition
    Datcu, Dragos
    Rothkrantz, Leon J. M.
    EUROMEDIA '2008, 2008, : 58 - 65
  • [49] Improving Semantic Scene Categorization by Exploiting Audio-Visual Features
    Zhu, Songhao
    Yan, Junchi
    Liu, Yuncai
    PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON IMAGE AND GRAPHICS (ICIG 2009), 2009, : 435 - 440
  • [50] Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing
    Rachavarapu, Kranthi Kumar
    Rajagopalan, A. N.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 10158 - 10168