Cantonese sentence dataset for lip-reading

被引:0
作者
Xiao, Yewei [1 ,2 ]
Liu, Xuanming [1 ,2 ]
Teng, Lianwei [1 ,2 ]
Zhu, Aosu [1 ,2 ]
Tian, Picheng [1 ,2 ]
Huang, Jian [1 ,2 ]
机构
[1] Xiangtan Univ, Sch Informat Engn, Xiangtan 411105, Peoples R China
[2] Xiangtan Univ, Key Lab Intelligent Comp & Informat Proc, Minist Educ, Xiangtan, Peoples R China
关键词
computer vision; image processing; image recognition; neural nets; pattern recognition; VASCULAR CONTRIBUTIONS; COGNITIVE IMPAIRMENT; TUMOR SEGMENTATION; ISCHEMIC-STROKE; CIRCLE; WILLIS; DEMENTIA; MODEL;
D O I
10.1049/ipr2.13123
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lip-reading deciphers speech by observing lip movements without relying on audio data. The rapid advancements in deep learning have significantly improved lip-reading for both English and Chinese; however, research on dialects such as Cantonese remains scarce. Consequently, most Chinese lip-reading datasets focus on Mandarin, with only a few addressing Cantonese. To bridge this gap, a sentence-level Cantonese lip-reading dataset, designated as Cantonese lip-reading sentences are introduced, comprising over 500 unique speakers and more than 30,000 samples. To ensure alignment with real-world scenarios, no restrictions are imposed on factors such as gender, age, posture, lighting conditions, or speech rate. A comprehensive description of the pipeline employed is provided for collecting and constructing the dataset and introduce an innovative visual frontend, 3D-visual attention net. This frontend combines the advantages of convolution and self-attention mechanisms to extract fine-grained lip region features. These features are subsequently input into the conformer backend for temporal sequence modelling, achieving comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip-reading sentences datasets. Benchmark tests on Cantonese lip-reading sentences demonstrate the challenges it poses, providing a novel research foundation for dialect lip-reading and fostering the advancement of Cantonese lip-reading tasks. Lip-reading deciphers speech without audio data, and deep learning advancements have improved lip-reading in English and Chinese. Cantonese lip-reading sentences, a Cantonese lip-reading dataset, and a novel visual frontend, 3D-visual attention net, which achieves comparable performance on Chinese Mandarin lip reading dataset, lip reading sentences 2, lip reading sentences 3, and Cantonese lip-reading sentences datasets, are introduced. This addresses the scarcity of Cantonese research and provides a new foundation for dialect lip-reading, fostering the advancement of Cantonese lip-reading tasks. image
引用
收藏
页码:2645 / 2664
页数:20
相关论文
共 61 条
[1]   Deep Audio-Visual Speech Recognition [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :8717-8727
[2]  
Afouras T, 2018, Arxiv, DOI arXiv:1809.00496
[3]  
Afouras T, 2020, INT CONF ACOUST SPEE, P2143, DOI [10.1109/icassp40776.2020.9054253, 10.1109/ICASSP40776.2020.9054253]
[4]  
Anina I, 2015, IEEE INT CONF AUTOMA
[5]   STATISTICAL INFERENCE FOR PROBABILISTIC FUNCTIONS OF FINITE STATE MARKOV CHAINS [J].
BAUM, LE ;
PETRIE, T .
ANNALS OF MATHEMATICAL STATISTICS, 1966, 37 (06) :1554-&
[6]  
Chitu AG, 2010, LECT NOTES ARTIF INT, V6231, P259, DOI 10.1007/978-3-642-15760-8_33
[7]  
Chong CS, 2024, BEHAV RES METHODS, V56, P6410, DOI 10.3758/s13428-023-02318-8
[8]  
Chu XX, 2021, Arxiv, DOI [arXiv:2102.10882, DOI 10.48550/ARXIV.2102.10882]
[9]   Out of Time: Automated Lip Sync in the Wild [J].
Chung, Joon Son ;
Zisserman, Andrew .
COMPUTER VISION - ACCV 2016 WORKSHOPS, PT II, 2017, 10117 :251-263
[10]   Lip Reading Sentences in the Wild [J].
Chung, Joon Son ;
Senior, Andrew ;
Vinyals, Oriol ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3444-3450