MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

被引：2

作者：

Wang, Jianrong ^{[1
]}

Huo, Yuchen ^{[2
]}

Liu, Li ^{[3
]}

Xu, Tianyi ^{[1
]}

Li, Qi ^{[4
]}

Li, Sen ^{[1
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China

[2] Tianjin Univ, Tianjin Int Engn Inst, Tianjin, Peoples R China

[3] Hong Kong Univ Sci & Technol Guangzhou, Guangzhou, Peoples R China

[4] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Audio-Visual Speech Recognition; Mandarin Audio-Visual Corpus; Azure Kinect; Depth Information; SPEECH; RECOGNITION; TECHNOLOGY;

D O I：

10.21437/Interspeech.2023-823

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction. However, the existing available Mandarin audio-visual datasets are limited and lack the depth information. To address this issue, this work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers. To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed to create a well-balanced reading material. In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition. We also provide a baseline experiment, which could be used to evaluate the effectiveness of the dataset. The dataset and code will be released at https://github.com/SpringHuo/MAVD.

引用

页码：2113 / 2117

页数：5

共 50 条

[31] A Large-Scale Benchmark Dataset for Anomaly Detection and Rare Event Classification for Audio Forensics
Abbasi, Ahmed
Javed, Abdul Rehman Rehman
Yasin, Amanullah
Jalil, Zunera
Kryvinska, Natalia
Tariq, Usman
IEEE ACCESS, 2022, 10 : 38885 - 38894
[32] A large-scale and PCR-referenced vocal audio dataset for COVID-19
Budd, Jobie
Baker, Kieran
Karoune, Emma
Coppock, Harry
Patel, Selina
Payne, Richard
Tendero Canadas, Ana
Titcomb, Alexander
Hurley, David
Egglestone, Sabrina
Butler, Lorraine
Mellor, Jonathon
Nicholson, George
Kiskin, Ivan
Koutra, Vasiliki
Jersakova, Radka
Mckendry, Rachel A.
Diggle, Peter
Richardson, Sylvia
Schuller, Bjoern W.
Gilmour, Steven
Pigoli, Davide
Roberts, Stephen
Packham, Josef
Thornley, Tracey
Holmes, Chris
SCIENTIFIC DATA, 2024, 11 (01)
[33] Glitch in the matrix: A large scale benchmark for content driven audio-visual forgery detection and localization
Cai, Zhixi
Ghosh, Shreya
Dhall, Abhinav
Gedeon, Tom
Stefanov, Kalin
Hayat, Munawar
COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 236
[34] VISCOUNTH: A Large-scale Multilingual Visual Question Answering Dataset for Cultural Heritage
Becattini, Federico
Bongini, Pietro
Bulla, Luana
Marinucci, Ludovica
del Bimbo, Alberto
Mongiovi, Misael
Presutti, Valentina
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
[35] Vis2Rec: A Large-Scale Visual Dataset for Visit Recommendation
Soumm, Michael
Popescu, Adrian
Delezoide, Bertrand
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2986 - 2996
[36] A large-scale dataset for indoor visual localization with high-precision ground truth
Liu, Yuchen
Gao, Wei
Hu, Zhanyi
INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2022, 41 (02): : 129 - 135
[37] KOLOMVERSE: Korea Open Large-Scale Image Dataset for Object Detection in the Maritime Universe
Nanda, Abhilasha
Cho, Sung Won
Lee, Hyeopwoo
Park, Jin Hyoung
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, : 20832 - 20840
[38] SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams
Abdrakhmanova, Madina
Kuzdeuov, Askat
Jarju, Sheikh
Khassanov, Yerbolat
Lewis, Michael
Varol, Huseyin Atakan
SENSORS, 2021, 21 (10)
[39] SODA: A large-scale open site object detection dataset for deep learning in construction
Duan, Rui
Deng, Hui
Tian, Mao
Deng, Yichuan
Lin, Jiarui
AUTOMATION IN CONSTRUCTION, 2022, 142
[40] Large Scale Functional Brain Networks Underlying Temporal Integration of Audio-Visual Speech Perception: An EEG Study
Kumar, G. Vinodh
Halder, Tamesh
Jaiswal, Amit K.
Mukherjee, Abhishek
Roy, Dipanjan
Banerjee, Arpan
FRONTIERS IN PSYCHOLOGY, 2016, 7

← 1 2 3 4 5 →