Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

被引：1

作者：

Truong, Duc-Tuan ^{[1
]}

Tao, Ruijie ^{[2
]}

Nguyen, Tuan ^{[3
]}

Luong, Hieu-Thi ^{[1
]}

Lee, Kong Aik ^{[4
]}

Chng, Eng Siong ^{[1
]}

机构：

[1] Nanyang Technol Univ, Singapore, Singapore

[2] Natl Univ Singapore, Singapore, Singapore

[3] ASTAR, Inst Infocomm Res I2R, Singapore, Singapore

[4] Hong Kong Polytech Univ, Hong Kong, Peoples R China

来源：

INTERSPEECH 2024 | 2024年

基金：

新加坡国家研究基金会;

关键词：

synthetic speech detection; attention learning; ASVspoof challenges;

D O I：

10.21437/Interspeech.2024-659

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.

引用

页码：537 / 541

页数：5