Self-supervised Learning Representation based Accent Recognition with Persistent Accent Memory

被引：1

作者：

Li, Rui ^{[1
,6
]}

Xie, Zhiwei ^{[1
,6
]}

Xu, Haihua ^{[2
]}

Peng, Yizhou ^{[3
]}

Liu, Hexin ^{[4
]}

Huang, Hao ^{[1
,5
]}

Chng, Eng Siong ^{[4
]}

机构：

[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China

[2] Bytedance, Beijing, Peoples R China

[3] Natl Univ Singapore, Singapore, Singapore

[4] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[5] Xinjiang Key Lab Multilingual Informat Technol, Urumqi, Peoples R China

[6] AISG NTU NUS Joint Speech Lab, Singapore, Singapore

来源：

INTERSPEECH 2023 | 2023年

基金：

国家重点研发计划;

关键词：

WavLM; Self-supervised learning; representation; accent recognition; persistent accent memory; Conformer;

D O I：

10.21437/Interspeech.2023-1702

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Accent recognition (AR) is challenging due to the lack of training data as well as the accents are entangled with speakers and regional characteristics. This paper aims to improve AR performance from two perspectives. First, to alleviate the data insufficiency problem, we employ the self-supervised learning representations (SSLRs) extracted from a pre-trained model to build the AR models. With the help of SSLRs, it gains significant performance improvement compared with the traditional acoustic features. Secondly, we proposed a persistent accent memory (PAM) as contextual knowledge to bias the AR models. The accent embeddings that are extracted from all training data by the encoder of AR models are clustered to form an accent codebook, i.e. PAM. In addition, we propose diverse attention mechanisms to investigate the optimal utilization of PAM. We observe that the best performance is obtained by selecting the most relevant accent embeddings.

引用

页码：1968 / 1972

页数：5

共 35 条

[31] Unsupervised Feature Learning via Non-Parametric Instance Discrimination [J].

Wu, Zhirong ;

Xiong, Yuanjun ;

Yu, Stella X. ;

Lin, Dahua .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3733-3742

[32] DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization [J].

Wuerkaixi, Abudukelimu ;

Yan, Kunda ;

Zhang, You ;

Duan, Zhiyao ;

Zhang, Changshui .

2022 IEEE 24TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2022,

[33] SUPERB: Speech processing Universal PERformance Benchmark [J].

Yang, Shu-wen ;

Chi, Po-Han ;

Chuang, Yung-Sung ;

Lai, Cheng-I Jeff ;

Lakhotia, Kushal ;

Lin, Yist Y. ;

Liu, Andy T. ;

Shi, Jiatong ;

Chang, Xuankai ;

Lin, Guan-Ting ;

Huang, Tzu-Hsien ;

Tseng, Wei-Cheng ;

Lee, Ko-tik ;

Liu, Da-Rong ;

Huang, Zili ;

Done, Shuyan ;

Li, Shang-Wen ;

Watanabe, Shinji ;

Mohamed, Abdelrahman ;

Lee, Hung-yi .

INTERSPEECH 2021, 2021, :1194-1198

[34]

Yang Y., 2022, ARXIV221100325

[35] E2E-based Multi-task Learning Approach to Joint Speech and Accent Recognition [J].

Zhang, Jicheng ;

Peng, Yizhou ;

Pham, Van Tung ;

Xu, Haihua ;

Huang, Hao ;

Chng, Eng Siong .

INTERSPEECH 2021, 2021, :1519-1523

← 1 2 3 4 →