One Model to Rule Them all: A Universal Transformer for Biometric Matching

被引:1
作者
Abdrakhmanova, Madina [1 ]
Yermekova, Assel [1 ]
Barko, Yuliya [1 ]
Ryspayev, Vladislav [1 ]
Jumadildayev, Medet [1 ]
Varol, Huseyin Atakan [1 ]
机构
[1] Nazarbayev Univ, Inst Smart Syst & Artificial Intelligence, Astana 010000, Kazakhstan
关键词
Transformers; Vectors; Feature extraction; Visualization; Speech recognition; Biological system modeling; Task analysis; Biometrics (access control); Biometric matching; cross-modal matching; face verification; face-audio association; metric learning; multimodal verification; speaker verification; transformer; PROTOTYPICAL NETWORKS; DEEP; SPEECH;
D O I
10.1109/ACCESS.2024.3426602
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study introduces the first single branch network designed to tackle a spectrum of biometric matching scenarios, including unimodal, multimodal, cross-modal, and missing modality situations. Our method adapts the prototypical network loss to concurrently train on audio, visual, and thermal data within a unified multimodal framework. By converting all three data types into image format, we employ the Vision Transformer (ViT) architecture with shared model parameters, enabling the encoder to transform input modalities into a unified vector space. The multimodal prototypical network loss function ensures that vector representations of the same speaker are proximate regardless of their original modalities. Evaluation on SpeakingFaces and VoxCeleb datasets encompasses a wide range of scenarios, demonstrating the effectiveness of our approach. The trimodal model achieves an Equal Error Rate (EER) of 0.27% on the SpeakingFaces test split, surpassing all previously reported results. Moreover, with a single training, it exhibits comparable performance with unimodal and bimodal counterparts, including unimodal audio, visual, and thermal, as well as audio-visual, audio-thermal, and visual-thermal configurations. In cross-modal evaluation on the VoxCeleb1 test set (audio versus visual), our approach yields an EER of 24.1%, again outperforming state-of-the-art models. This underscores the effectiveness of our unified model in addressing diverse scenarios for biometric verification.
引用
收藏
页码:96729 / 96739
页数:11
相关论文
共 57 条
[1]   Multimodal Person Verification With Generative Thermal Data Augmentation [J].
Abdrakhmanova, Madina ;
Unaspekov, Timur ;
Varol, Huseyin Atakan .
IEEE TRANSACTIONS ON BIOMETRICS, BEHAVIOR, AND IDENTITY SCIENCE, 2024, 6 (01) :43-53
[2]   SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams [J].
Abdrakhmanova, Madina ;
Kuzdeuov, Askat ;
Jarju, Sheikh ;
Khassanov, Yerbolat ;
Lewis, Michael ;
Varol, Huseyin Atakan .
SENSORS, 2021, 21 (10)
[3]  
Abdrakhmanova S., 2022, P SPEAK LANG REC WOR, P233
[4]   Speaker recognition based on deep learning: An overview [J].
Bai, Zhongxin ;
Zhang, Xiao-Lei .
NEURAL NETWORKS, 2021, 140 :65-99
[5]  
Cai J., 2018, P SPEAK LANG REC WOR, P1
[6]  
Chung A., 2018, INTERSPEECH, P1
[7]   In defence of metric learning for speaker recognition [J].
Chung, Joon Son ;
Huh, Jaesung ;
Mun, Seongkyu ;
Lee, Minjae ;
Heo, Hee-Soo ;
Choe, Soyeon ;
Ham, Chiheon ;
Jung, Sunghwan ;
Lee, Bong-Jin ;
Han, Icksang .
INTERSPEECH 2020, 2020, :2977-2981
[8]   ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].
Deng, Jiankang ;
Guo, Jia ;
Xue, Niannan ;
Zafeiriou, Stefanos .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694
[9]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, 10.48550/arXiv.2010.11929, DOI 10.48550/ARXIV.2010.11929]
[10]   PMR: Prototypical Modal Rebalance for Multimodal Learning [J].
Fan, Yunfeng ;
Xu, Wenchao ;
Wang, Haozhao ;
Wang, Junxiao ;
Guo, Song .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :20029-20038