Self-supervised Learning Representation based Accent Recognition with Persistent Accent Memory

被引：1

作者：

Li, Rui ^{[1
,6
]}

Xie, Zhiwei ^{[1
,6
]}

Xu, Haihua ^{[2
]}

Peng, Yizhou ^{[3
]}

Liu, Hexin ^{[4
]}

Huang, Hao ^{[1
,5
]}

Chng, Eng Siong ^{[4
]}

机构：

[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi, Peoples R China

[2] Bytedance, Beijing, Peoples R China

[3] Natl Univ Singapore, Singapore, Singapore

[4] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore

[5] Xinjiang Key Lab Multilingual Informat Technol, Urumqi, Peoples R China

[6] AISG NTU NUS Joint Speech Lab, Singapore, Singapore

来源：

INTERSPEECH 2023 | 2023年

基金：

国家重点研发计划;

关键词：

WavLM; Self-supervised learning; representation; accent recognition; persistent accent memory; Conformer;

D O I：

10.21437/Interspeech.2023-1702

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Accent recognition (AR) is challenging due to the lack of training data as well as the accents are entangled with speakers and regional characteristics. This paper aims to improve AR performance from two perspectives. First, to alleviate the data insufficiency problem, we employ the self-supervised learning representations (SSLRs) extracted from a pre-trained model to build the AR models. With the help of SSLRs, it gains significant performance improvement compared with the traditional acoustic features. Secondly, we proposed a persistent accent memory (PAM) as contextual knowledge to bias the AR models. The accent embeddings that are extracted from all training data by the encoder of AR models are clustered to form an accent codebook, i.e. PAM. In addition, we propose diverse attention mechanisms to investigate the optimal utilization of PAM. We observe that the best performance is obtained by selecting the most relevant accent embeddings.

引用

页码：1968 / 1972

页数：5

共 35 条

[1]

Arasteh Soroosh Tayebi, 2020, ARXIV201104896

[2]

Baevski A, 2020, ADV NEUR IN, V33

[3]

Chang X., 2022, ARXIV220400540

[4]

Chen S., 2022, ARXIV220412765

[5] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].

Chen, Sanyuan ;

Wang, Chengyi ;

Chen, Zhengyang ;

Wu, Yu ;

Liu, Shujie ;

Chen, Zhuo ;

Li, Jinyu ;

Kanda, Naoyuki ;

Yoshioka, Takuya ;

Xiao, Xiong ;

Wu, Jian ;

Zhou, Long ;

Ren, Shuo ;

Qian, Yanmin ;

Qian, Yao ;

Zeng, Michael ;

Yu, Xiangzhan ;

Wei, Furu .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518

[6]

Chen Z., 2022, ARXIV221105172

[7] Unsupervised Cross-lingual Representation Learning for Speech Recognition [J].

Conneau, Alexis ;

Baevski, Alexei ;

Collobert, Ronan ;

Mohamed, Abdelrahman ;

Auli, Michael .

INTERSPEECH 2021, 2021, :2426-2430

[8] Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning [J].

Das, Nilaksh ;

Bodapati, Sravan ;

Sunkara, Monica ;

Srinivasan, Sundararajan ;

Chau, Duen Horng .

INTERSPEECH 2021, 2021, :1314-1318

[9] Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-supervised Learning [J].

Deng, Keqi ;

Cao, Songjun ;

Ma, Long .

INTERSPEECH 2021, 2021, :1504-1508

[10]

Gong X, 2021, INTERSPEECH, P1274, DOI [10.1109/TASLP.2022.3198546, 10.21437/Interspeech.2021-1075]

← 1 2 3 4 →