Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

被引:0
|
作者
Liu, Nayu [1 ,2 ]
Sun, Xian [1 ,2 ]
Yul, Hongfeng [1 ]
Zhangi, Wenkai [1 ]
Xui, Guangluan [1 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Key Lab Network Informat Syst Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, we propose a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling fine-grained interactions between the multisource modalities through a multistep fusion schema and controlling the flow of redundant information between multimodal long sequences via a forgetting module. Experimental results on the How2 dataset show that our proposed model achieves a new state-of-the-art performance. Comprehensive analysis empirically verifies the effectiveness of our fusion schema and forgetting module on multiple encoder-decoder architectures. Specially, when using high noise ASR transcripts (WER>30%), our model still achieves performance close to the ground-truth transcript model, which reduces manual annotation cost.
引用
收藏
页码:1834 / 1845
页数:12
相关论文
共 21 条
  • [21] Adaptive multimodal feature fusion with frequency domain gate for remote sensing object detection
    Sun, Xu
    Yu, Yinhui
    Cheng, Qing
    REMOTE SENSING LETTERS, 2024, 15 (02) : 133 - 144