MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information

被引:2
|
作者
Wang, Jianrong [1 ]
Huo, Yuchen [2 ]
Liu, Li [3 ]
Xu, Tianyi [1 ]
Li, Qi [4 ]
Li, Sen [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Tianjin Univ, Tianjin Int Engn Inst, Tianjin, Peoples R China
[3] Hong Kong Univ Sci & Technol Guangzhou, Guangzhou, Peoples R China
[4] Tianjin Univ, Sch Elect & Informat Engn, Tianjin, Peoples R China
来源
INTERSPEECH 2023 | 2023年
基金
中国国家自然科学基金;
关键词
Audio-Visual Speech Recognition; Mandarin Audio-Visual Corpus; Azure Kinect; Depth Information; SPEECH; RECOGNITION; TECHNOLOGY;
D O I
10.21437/Interspeech.2023-823
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio-visual speech recognition (AVSR) gains increasing attention from researchers as an important part of human-computer interaction. However, the existing available Mandarin audio-visual datasets are limited and lack the depth information. To address this issue, this work establishes the MAVD, a new large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by 64 native Chinese speakers. To ensure the dataset covers diverse real-world scenarios, a pipeline for cleaning and filtering the raw text material has been developed to create a well-balanced reading material. In particular, the latest data acquisition device of Microsoft, Azure Kinect is used to capture depth information in addition to the traditional audio signals and RGB images during data acquisition. We also provide a baseline experiment, which could be used to evaluate the effectiveness of the dataset. The dataset and code will be released at https://github.com/SpringHuo/MAVD.
引用
收藏
页码:2113 / 2117
页数:5
相关论文
共 50 条
  • [31] A Large-Scale Benchmark Dataset for Anomaly Detection and Rare Event Classification for Audio Forensics
    Abbasi, Ahmed
    Javed, Abdul Rehman Rehman
    Yasin, Amanullah
    Jalil, Zunera
    Kryvinska, Natalia
    Tariq, Usman
    IEEE ACCESS, 2022, 10 : 38885 - 38894
  • [32] A large-scale and PCR-referenced vocal audio dataset for COVID-19
    Budd, Jobie
    Baker, Kieran
    Karoune, Emma
    Coppock, Harry
    Patel, Selina
    Payne, Richard
    Tendero Canadas, Ana
    Titcomb, Alexander
    Hurley, David
    Egglestone, Sabrina
    Butler, Lorraine
    Mellor, Jonathon
    Nicholson, George
    Kiskin, Ivan
    Koutra, Vasiliki
    Jersakova, Radka
    Mckendry, Rachel A.
    Diggle, Peter
    Richardson, Sylvia
    Schuller, Bjoern W.
    Gilmour, Steven
    Pigoli, Davide
    Roberts, Stephen
    Packham, Josef
    Thornley, Tracey
    Holmes, Chris
    SCIENTIFIC DATA, 2024, 11 (01)
  • [33] Glitch in the matrix: A large scale benchmark for content driven audio-visual forgery detection and localization
    Cai, Zhixi
    Ghosh, Shreya
    Dhall, Abhinav
    Gedeon, Tom
    Stefanov, Kalin
    Hayat, Munawar
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 236
  • [34] VISCOUNTH: A Large-scale Multilingual Visual Question Answering Dataset for Cultural Heritage
    Becattini, Federico
    Bongini, Pietro
    Bulla, Luana
    Marinucci, Ludovica
    del Bimbo, Alberto
    Mongiovi, Misael
    Presutti, Valentina
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
  • [35] Vis2Rec: A Large-Scale Visual Dataset for Visit Recommendation
    Soumm, Michael
    Popescu, Adrian
    Delezoide, Bertrand
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2986 - 2996
  • [36] A large-scale dataset for indoor visual localization with high-precision ground truth
    Liu, Yuchen
    Gao, Wei
    Hu, Zhanyi
    INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2022, 41 (02): : 129 - 135
  • [37] KOLOMVERSE: Korea Open Large-Scale Image Dataset for Object Detection in the Maritime Universe
    Nanda, Abhilasha
    Cho, Sung Won
    Lee, Hyeopwoo
    Park, Jin Hyoung
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, : 20832 - 20840
  • [38] SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams
    Abdrakhmanova, Madina
    Kuzdeuov, Askat
    Jarju, Sheikh
    Khassanov, Yerbolat
    Lewis, Michael
    Varol, Huseyin Atakan
    SENSORS, 2021, 21 (10)
  • [39] SODA: A large-scale open site object detection dataset for deep learning in construction
    Duan, Rui
    Deng, Hui
    Tian, Mao
    Deng, Yichuan
    Lin, Jiarui
    AUTOMATION IN CONSTRUCTION, 2022, 142
  • [40] Large Scale Functional Brain Networks Underlying Temporal Integration of Audio-Visual Speech Perception: An EEG Study
    Kumar, G. Vinodh
    Halder, Tamesh
    Jaiswal, Amit K.
    Mukherjee, Abhishek
    Roy, Dipanjan
    Banerjee, Arpan
    FRONTIERS IN PSYCHOLOGY, 2016, 7