Lip Feature Disentanglement for Visual Speaker Authentication in Natural Scenes

被引：1

作者：

He, Yi ^{[1
]}

Yang, Lei ^{[1
]}

Wang, Shilin ^{[1
]}

Liew, Alan Wee-Chung ^{[2
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Elect Informat & Elect Engn, Shanghai 200240, Peoples R China

[2] Griffith Univ, Sch Informat & Commun Technol, Gold Coast, Qld 4222, Australia

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2024年 / 34卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Lips; Authentication; Deepfakes; Visualization; Data mining; Shape; Lip biometrics; disentangled representation learning; metric learning; DeepFake spoofs;

D O I：

10.1109/TCSVT.2024.3405640

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Recent studies have shown that lip shape and movement can be used as an effective biometric feature for speaker authentication. By using random prompt text scheme, lip-based authentication system can also achieve good liveness detection performance in laboratory scenarios. However, due to the increasingly widespread mobile application, the authentication system may face additional practical difficulties such as complex background, limited user samples, etc., which will degrade the authentication performance derived by current methods. To confront the above problems, a new deep neural network, i.e. the Triple-feature Disentanglement Network for Visual Speaker Authentication (TDVSA-Net), is proposed in this paper to extract discriminative and disentangled lip features for visual speaker authentication in the random prompt text scenario. Three decoupled lip features, including the content feature inferring the speech content, the physiological lip feature describing the static lip shape and appearance and the behavioral lip feature depicting the unique pattern in lip movements during utterance, are extracted by TDVSA-Net and fed into corresponding modules to authenticate both the prompt text and the speaker's identity. Experiment results have demonstrated that compared with several SOTA visual speaker authentication methods, the proposed TDVSA-Net can extract more discriminative and robust lip features which boost the content recognition and identity authentication performance against both human imposters and DeepFake attacks.

引用

页码：9898 / 9909

页数：12

共 53 条

[1] Unpaired Motion Style Transfer from Video to Animation [J].

Aberman, Kfir ;

Weng, Yijia ;

Lischinski, Dani ;

Cohen-Or, Daniel ;

Chen, Baoquan .

ACM TRANSACTIONS ON GRAPHICS, 2020, 39 (04)

[2] VGGFace2: A dataset for recognising faces across pose and age [J].

Cao, Qiong ;

Shen, Li ;

Xie, Weidi ;

Parkhi, Omkar M. ;

Zisserman, Andrew .

PROCEEDINGS 2018 13TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION (FG 2018), 2018, :67-74

[3] Local Ordinal Contrast Pattern Histograms for Spatiotemporal, Lip-Based Speaker Authentication [J].

Chan, Chi Ho ;

Goswami, Budhaditya ;

Kittler, Josef ;

Christmas, William .

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2012, 7 (02) :602-612

[4] SimSwap: An Efficient Framework For High Fidelity Face Swapping [J].

Chen, Renwang ;

Chen, Xuanhong ;

Ni, Bingbing ;

Ge, Yanhao .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :2003-2011

[5] Visual speaker authentication with random prompt texts by a dual-task CNN framework [J].

Cheng, Feng ;

Wang, Shi-Lin ;

Liew, Alan Wee-Chung .

PATTERN RECOGNITION, 2018, 83 :340-352

[6]

Chung JY, 2014, Arxiv, DOI [arXiv:1412.3555, DOI 10.48550/ARXIV.1412.3555]

[7] An audio-visual corpus for speech perception and automatic speech recognition (L) [J].

Cooke, Martin ;

Barker, Jon ;

Cunningham, Stuart ;

Shao, Xu .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424

[8] Two-frame motion estimation based on polynomial expansion [J].

Farnebäck, G .

IMAGE ANALYSIS, PROCEEDINGS, 2003, 2749 :363-370

[9]

github, 2022, Faceswap

[10]

github, DeepFaceLab

← 1 2 3 4 5 6 →