TIVA-KG: A Multimodal Knowledge Graph with Text, Image, Video and Audio

被引：7

作者：

Wang, Xin ^{[1
]}

Meng, Benyuan ^{[2
]}

Chen, Hong ^{[2
]}

Meng, Yuan ^{[2
]}

Lv, Ke ^{[3
]}

Zhu, Wenwu ^{[1
]}

机构：

[1] Tsinghua Univ, Dept Comp Sci & Technol, BNRist, Beijing, Peoples R China

[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[3] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Knowledge Graph; Multimodal; Text; Video; Image; Audio;

D O I：

10.1145/3581783.3612266

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge graphs serve as a powerful tool to boost model performances for various applications covering computer vision, natural language processing, multimedia data mining, etc. The process of knowledge acquisition for human is multimodal in essence, covering text, image, video and audio modalities. However, existing multimodal knowledge graphs fail to cover all these four elements simultaneously, severely limiting their expressive powers in performance improvement for downstream tasks. In this paper, we propose TIVA-KG, a multimodal Knowledge Graph covering Text, Image, Video and Audio, which can benefit various downstream tasks. Our proposed TIVA-KG has two significant advantages over existing knowledge graphs in i) coverage of up to four modalities including text, image, video, audio, and ii) capability of triplet grounding which grounds multimodal relations to triples instead of entities. We further design a Quadruple Embedding Baseline (QEB) model to validate the necessity and efficacy of considering four modalities in KG. We conduct extensive experiments to test the proposed TIVA-KG with various knowledge graph representation approaches over link prediction task, demonstrating the benefits and necessity of introducing multiple modalities and triplet grounding. TIVA-KG is expected to promote further research on mining multimodal knowledge graph as well as the relevant downstream tasks in the community. TIVA-KG is now available at our website: http://mn.cs.tsinghua.edu.cn/tivakg.

引用

页码：2391 / 2399

页数：9

共 41 条

[1] Alberts Houda, 2021, P 1 WORKSH MULT REPR, P138, DOI DOI 10.18653/V1/2021.MRL-1.13
[2] [Anonymous], 2013, P 22 INT C WORLD WID, P413, DOI DOI 10.1145/2488388.2488425
[3] Balazevic I, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5185
[4] Bordes A., 2013, Advances in Neural Information Processing Systems, V26, P1, DOI [10.5555/2999792.2999923, DOI 10.5555/2999792.2999923]
[5] The WASABI Dataset: Cultural, Lyrics and Audio Analysis Metadata About 2 Million Popular Commercially Released Songs
Buffa, Michel
Cabrio, Elena
Fell, Michael
Gandon, Fabien
Giboin, Alain
Hennequin, Romain
Michel, Franck
Pauwels, Johan
Pellerin, Guillaume
Tikat, Maroua
Winckler, Marco
[J]. SEMANTIC WEB, ESWC 2021, 2021, 12731 : 515 - 531
[6] Ebisu T, 2018, AAAI CONF ARTIF INTE, P1819
[7] Ferrada Sebastian, 2017, SEMWEB
[8] Galárraga L, 2015, VLDB J, V24, P707, DOI 10.1007/s00778-015-0394-1
[9] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
[10] Hershey S, 2017, INT CONF ACOUST SPEE, P131, DOI 10.1109/ICASSP.2017.7952132

← 1 2 3 4 5 →