Improving Multimodal Movie Scene Segmentation Using Mixture of Acoustic Experts

被引：0

作者：

Lin, Meng-Han ^{[1
]}

Li, Jeng-Lin ^{[1
]}

Lee, Chi-Chun ^{[1
]}

机构：

[1] Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu, Taiwan

来源：

2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022) | 2022年

关键词：

Movie; Scene Segmentation; Mixture of Experts; Multimodal Attention; Audio;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Scenes are the most basic semantic units of a movie that are important as pre-processing for various multimedia computing technology. Previous scene segmentation studies have introduced constraints and alignment mechanisms to cluster low-level frames and shots based on the visual features and temporal properties. Recent researchers have extended by using multimodal semantic representations with the acoustic representations blindly extracted by a universal pretrained model. They tend to ignore the semantic meaning of audio and complex interaction between the audio and visual representations for scene segmentation. In this work, we introduce a mixture-of-audio-experts (MOAE) framework to integrate acoustic experts and multimodal experts for scene segmentation. The acoustic expert is learned to model different acoustic semantics, including speaker, environmental sounds, and other events. The MOAE optimizes the weights delicately among various multimodal experts and achieves a state-of-the-art 61.89% F1-score for scene segmentation. We visualize the expert weights in our framework to illustrate the complementary properties among diverse experts, leading to improvements for segmentation results.

引用

页码：6 / 10

页数：5

共 25 条

[1]

[Anonymous], 2017, CoRR

[2] Look, Listen and Learn [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617

[3] A Deep Siamese Network for Scene Detection in Broadcast Videos [J].

Baraldi, Lorenzo ;

Grana, Costantino ;

Cucchiara, Rita .

MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, :1199-1202

[4] Scene Detection in Videos Using Shot Clustering and Sequence Alignment [J].

Chasanis, Vasileios T. ;

Likas, Aristidis C. ;

Galatsanos, Nikolaos P. .

IEEE TRANSACTIONS ON MULTIMEDIA, 2009, 11 (01) :89-100

[5]

Chen Lei., 2002, P INT C IMAGE PROCES, V2, pII

[6]

Cramer J, 2019, INT CONF ACOUST SPEE, P3852, DOI 10.1109/ICASSP.2019.8682475

[7] Event segmentation and seven types of narrative discontinuity in popular movies [J].

Cutting, James E. .

ACTA PSYCHOLOGICA, 2014, 149 :69-77

[8]

Eigen D., 2014, 2 INT C LEARNING REP

[9]

Gemmeke JF, 2017, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.2017.7952261

[10]

Hershey S, 2017, INT CONF ACOUST SPEE, P131, DOI 10.1109/ICASSP.2017.7952132

← 1 2 3 →