Visual Speech Recognition in Natural Scenes Based on Spatial Transformer Networks

被引：0

作者：

Yu, Jin ^{[1
]}

Wang, Shilin ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai, Peoples R China

来源：

2020 IEEE 14TH INTERNATIONAL CONFERENCE ON ANTI-COUNTERFEITING, SECURITY, AND IDENTIFICATION (ASID) | 2020年

基金：

中国国家自然科学基金;

关键词：

visual speech recognition; natural scenes; spatial transformer networks; liveness detection; EXTRACTION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we improve the performance of visual speech recognition in natural scenes based on spatial transformer networks. Visual speech recognition can be applied to authentication systems for liveness detection to avoid replay attacks and ensure security. Identity authentication based on visual speech recognition may be conducted anywhere on portable electronic devices. However, a great number of variations exist in natural scenes including diverse speakers' poses, different distances towards the camera, occasional quiver of the lips, etc., which bring tremendous troubles for the recognition, leading to poorer performance of the authentication system. In view of the challenges, we introduce the spatial transformer networks (STN), which can help deal with variations, especially in complex natural scenes. Considering the characteristics of the lip feature, a new transformation network is proposed, which fuses the temporal and spatial information to generate transformation parameters. The well-designed network can be simply inserted into existing visual speech recognition approaches to implement end-to-end training. By taking temporal dependencies into consideration, a better transformation is performed to normalize the lip image sequences and difficulties of visual speech recognition in natural scene can thus be reduced, which is beneficial to the identity authentication system to enhance security. From the experimental results, it is demonstrated that a decreased word error rate can be achieved, particularly in natural scenes, when our approach is adopted.

引用

页码：1 / 5

页数：5

共 27 条

[1]

Afouras Triantafyllos, 2018, IEEE T PATTERN ANAL, DOI DOI 10.1109/TPAMI.2018.2889052

[2]

[Anonymous], 2016, P AS C COMP VIS

[3]

[Anonymous], ABS151203385 CORR

[4]

[Anonymous], 2011, 2011 7 IR C MACH VIS, DOI DOI 10.1109/IRANIANMVIP.2011.6121606

[5]

Banimahd S. R., 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), P2432, DOI 10.1109/ICPR.2010.595

[6] STATISTICAL INFERENCE FOR PROBABILISTIC FUNCTIONS OF FINITE STATE MARKOV CHAINS [J].

BAUM, LE ;

PETRIE, T .

ANNALS OF MATHEMATICAL STATISTICS, 1966, 37 (06) :1554-&

[7]

Chalamala SR, 2015, I SYMP CONSUM ELECTR, P459, DOI 10.1109/ICCE.2015.7066486

[8] Visual speaker authentication with random prompt texts by a dual-task CNN framework [J].

Cheng, Feng ;

Wang, Shi-Lin ;

Liew, Alan Wee-Chung .

PATTERN RECOGNITION, 2018, 83 :340-352

[9] Lip Segmentation under MAP-MRF Framework with Automatic Selection of Local Observation Scale and Number of Segments [J].

Cheung, Yiu-ming ;

Li, Meng ;

Cao, Xiaochun ;

You, Xinge .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014, 23 (08) :3397-3411

[10] Lips Contour Detection and Tracking Using Watershed Region-Based Active Contour Model and Modified H∞ [J].

Chin, Siew Wen ;

Seng, Kah Phooi ;

Ang, Li-Minn .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2012, 22 (06) :869-874

← 1 2 3 →