MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

被引：2

作者：

Wang, Jianrong ^{[1
]}

Huo, Yuchen ^{[2
]}

Liu, Li ^{[3
]}

Xu, Tianyi ^{[1
]}

Li, Qi ^{[4
]}

Li, Sen ^{[1
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China

[2] Tianjin Univ, Tianjin Int Engn Inst, Tianjin, Peoples R China

[3] Hong Kong Univ Sci & Technol Guangzhou, Guangzhou, Peoples R China

[4] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Audio-Visual Speech Recognition; Mandarin Audio-Visual Corpus; Azure Kinect; Depth Information; SPEECH; RECOGNITION; TECHNOLOGY;

D O I：

10.21437/Interspeech.2023-823

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction. However, the existing available Mandarin audio-visual datasets are limited and lack the depth information. To address this issue, this work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers. To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed to create a well-balanced reading material. In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition. We also provide a baseline experiment, which could be used to evaluate the effectiveness of the dataset. The dataset and code will be released at https://github.com/SpringHuo/MAVD.

引用

页码：2113 / 2117

页数：5

共 50 条

[21] EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes
Yang, Jingyuan
Huang, Qirui
Ding, Tingting
Lischinski, Dani
Cohen-Or, Daniel
Huang, Hui
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 20326 - 20337
[22] A large-scale fMRI dataset for the visual processing of naturalistic scenes
Gong, Zhengxin
Zhou, Ming
Dai, Yuxuan
Wen, Yushan
Liu, Youyi
Zhen, Zonglei
SCIENTIFIC DATA, 2023, 10 (01)
[23] A Large-scale Dataset of (Open Source) License Text Variants
Zacchiroli, Stefano
2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 757 - 761
[24] Large-Scale Room Impulse Response Dataset Compression With Neural Audio Codecs
Mezza, Alessandro Ilic
Bernardini, Alberto
Antonacci, Fabio
2024 IEEE 5TH INTERNATIONAL SYMPOSIUM ON THE INTERNET OF SOUNDS, IS2 2024, 2024, : 102 - 109
[25] Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks
Schindler, Alexander
Boyer, Martin
Lindley, Andrew
Schreiber, David
Philipp, Thomas
MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 106 - 119
[26] Development of a large-scale medical visual question-answering dataset
Zhang, Xiaoman
Wu, Chaoyi
Zhao, Ziheng
Lin, Weixiong
Zhang, Ya
Wang, Yanfeng
Xie, Weidi
COMMUNICATIONS MEDICINE, 2024, 4 (01):
[27] RnR: Extraction of Visual Attributes from Large-Scale Fashion Dataset
Lee, Sungjae
Lee, Yeonji
Kim, Junho
Lee, Kyungyong
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 5043 - 5047
[28] Hierarchical Transformer for Visual Affordance Understanding using a Large-scale Dataset
Shah, Syed Afaq Ali
Khalifa, Zeyad
2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 11371 - 11376
[29] Visual information system of large-scale underground caverns
Yang, Qiang
Zhou, Weiyuan
Yang, Ruoqiong
Yanshilixue Yu Gongcheng Xuebao/Chinese Journal of Rock Mechanics and Engineering, 2000, 19 (SUPPL.): : 1042 - 1047
[30] Deep monocular depth estimation leveraging a large-scale outdoor stereo dataset
Cho, Jaehoon
Min, Dongbo
Kim, Youngjung
Sohn, Kwanghoon
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 178

← 1 2 3 4 5 →