Sound of Story: Multi-modal Storytelling with Audio

被引：0

作者：

Bae, Jaeyeon ^{[1
]}

Jeong, Seokhoon ^{[2
]}

Kong, Seokun ^{[1
]}

Han, Namgi ^{[3
]}

Lee, Jae-Yon ^{[3
]}

Kim, Hyounghun ^{[1
,2
]}

Kim, Taehwan ^{[1
,2
]}

机构：

[1] Ulsan Natl Inst Sci & Technol, Artificial Intelligence Grad Sch, Ulsan, South Korea

[2] Ulsan Natl Inst Sci & Technol, Dept Comp Sci & Engn, Ulsan, South Korea

[3] Ulsan Natl Inst Sci & Technol, Sch Liberal Arts, Ulsan, South Korea

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023) | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of the story. Therefore, we propose to extend story understanding and telling areas by establishing a new component called background sound which is story context-based audio without any linguistic information. For this purpose, we introduce a new dataset, called Sound of Story (SoS), which has paired image and text sequences with corresponding sound or background music for a story. To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6 images per story and 984 hours of speech-decoupled audio such as background music and other sounds. As benchmark tasks for storytelling with sound and the dataset, we propose retrieval tasks between modalities, and audio generation tasks from image-text sequences, introducing strong baselines for them. We believe the proposed dataset and tasks may shed light on the multi-modal understanding of storytelling in terms of sound. Downloading the dataset and baseline codes for each task will be released in the link: https: //github.com/Sosdatasets/SoS_Dataset.

引用

页码：13467 / 13479

页数：13

共 70 条

[1]

Agostinelli A., 2023, arXiv, DOI 10.48550

[2]

Akoury Nader, 2020, ARXIV

[3]

[Anonymous], 2019, ARXIV, DOI DOI 10.1109/ICCV.2019.00534

[4]

Bain M., 2020, Condensed Movies: Story Based Retrieval with Contextual

[5]

Bensaid Eden, 2021, ARXIV

[6]

Bertin-Mahieux T., 2011, The million song dataset

[7]

Borsos Z., 2022, arXiv

[8] Hsa_circ_0005548 knockdown repairs OGD/R-induced damage in human brain microvascular endothelial cells via miR-362-3p/ETS1 axis [J].

Chen, Chunlei ;

Xu, Jiguo ;

Huang, Tianrun ;

Qian, Zhuolei .

INTERNATIONAL JOURNAL OF NEUROSCIENCE, 2024, 134 (10) :1139-1148

[9]

Chen Muhao, 2021, P 59 ANN M ASS C COM

[10]

Chen X., 2022, ARXIV

← 1 2 3 4 5 6 7 →