SEF-Net: Speaker Embedding Free Target Speaker Extraction Network

被引：1

作者：

Zeng, Bang ^{[1
,2
]}

Suo, Hongbin ^{[3
]}

Wan, Yulong ^{[3
]}

Li, Ming ^{[1
,2
]}

机构：

[1] Wuhan Univ, Sch Comp Sci, Wuhan, Peoples R China

[2] Duke Kunshan Univ, Data Sci Res Ctr, Kunshan, Peoples R China

[3] OPPO, Data&AI Engn Syst, Beijing, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Target speaker extraction; speaker embedding free; dual-path; conformer; SEPARATION;

D O I：

10.21437/Interspeech.2023-1749

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Most target speaker extraction methods use the target speaker embedding as reference information. However, the speaker embedding extracted by a speaker recognition module may not be optimal for the target speaker extraction tasks. In this paper, we proposes Speaker Embedding Free target speaker extraction Network (SEF-Net), a novel target speaker extraction model without relying on speaker embedding. SEF-Net uses cross multi-head attention in the transformer decoder to implicitly utilize the speaker information in the reference speech's conformer encoding outputs. Experimental results show that our proposed model achieves comparable performance to other target speaker extraction models. SEF-Net provides a feasible new solution to perform target speaker extraction without using a speaker embedding extractor or speaker recognition loss function.

引用

页码：3452 / 3456

页数：5

共 33 条

[1] Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation [J].

Chen, Jingjing ;

Mao, Qirong ;

Liu, Dong .

INTERSPEECH 2020, 2020, :2642-2646

[2]

Chen Zhuo, 2017, Proc IEEE Int Conf Acoust Speech Signal Process, V2017, P246, DOI 10.1109/ICASSP.2017.7952155

[3] SOME EXPERIMENTS ON THE RECOGNITION OF SPEECH, WITH ONE AND WITH 2 EARS [J].

CHERRY, EC .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1953, 25 (05) :975-979

[4]

Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661

[5]

Elminshawi M., 2022, NEW INSIGHTS TARGET

[6] SpEx plus : A Complete Time Domain Speaker Extraction Network [J].

Ge, Meng ;

Xu, Chenglin ;

Wang, Longbiao ;

Chng, Eng Siong ;

Dang, Jianwu ;

Li, Haizhou .

INTERSPEECH 2020, 2020, :1406-1410

[7] Conformer: Convolution-augmented Transformer for Speech Recognition [J].

Gulati, Anmol ;

Qin, James ;

Chiu, Chung-Cheng ;

Parmar, Niki ;

Zhang, Yu ;

Yu, Jiahui ;

Han, Wei ;

Wang, Shibo ;

Zhang, Zhengdong ;

Wu, Yonghui ;

Pang, Ruoming .

INTERSPEECH 2020, 2020, :5036-5040

[8] WASE: LEARNING WHEN TO ATTEND FOR SPEAKER EXTRACTION IN COCKTAIL PARTY ENVIRONMENTS [J].

Hao, Yunzhe ;

Xu, Jiaming ;

Zhang, Peng ;

Xu, Bo .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6104-6108

[9]

Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631

[10] Independent component analysis:: algorithms and applications [J].

Hyvärinen, A ;

Oja, E .

NEURAL NETWORKS, 2000, 13 (4-5) :411-430

← 1 2 3 4 →