LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

被引：5

作者：

Liu, Hexin ^{[1
]}

Perera, Leibny Paola Garcia ^{[2
]}

Khong, Andy W. H. ^{[1
]}

Styles, Suzy J. ^{[3
]}

Khudanpur, Sanjeev ^{[2
]}

机构：

[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore, Singapore

[2] Johns Hopkins Univ, CLSP & HLT COE, Baltimore, MD 21218 USA

[3] Nanyang Technol Univ, Sch Social Sci, Psychol, Singapore, Singapore

来源：

INTERSPEECH 2022 | 2022年

基金：

新加坡国家研究基金会; 美国国家科学基金会;

关键词：

Language identification; acoustic phonetics; phonotactics; self-supervised learning; phoneme segmentation; RECOGNITION;

D O I：

10.21437/Interspeech.2022-354

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of "phonotactic" embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multi-task optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.

引用

页码：2233 / 2237

页数：5

共 8 条

[1] Effects of acoustic-phonetic detail on cross-language speech production
Wilson, Colin
Davidson, Lisa
Martin, Sean
JOURNAL OF MEMORY AND LANGUAGE, 2014, 77 : 1 - 24
[2] Acoustic-phonetic feature based Kannada dialect identification from vowel sounds
Chittaragi, Nagaratna B.
Koolagudi, Shashidhar G.
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2019, 22 (04) : 1099 - 1113
[3] Phonetic Temporal Neural Model for Language Identification
Tang, Zhiyuan
Wang, Dong
Chen, Yixiang
Li, Lantian
Abel, Andrew
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (01) : 134 - 144
[4] An Acoustic-Phonetic Model of F0 Likelihood for Vocal Melody Extraction
Chien, Yu-Ren
Wang, Hsin-Min
Jeng, Shyh-Kang
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (09) : 1457 - 1468
[5] Fusion of Contrastive Acoustic Models for Parallel Phonotactic Spoken Language Identification
Sim, Khe Chai
Li, Haizhou
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 541 - 544
[6] Dichotic integration of acoustic-phonetic information: Competition from extraneous formants increases the effect of second-formant attenuation on intelligibility
Roberts, Brian
Summers, Robert J.
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2019, 145 (03) : 1230 - 1240
[7] Incorporating local information of the acoustic environments to MAP-based feature compensation and acoustic model adaptation
Tsao, Yu
Lu, Xugang
Dixon, Paul
Hu, Ting-yao
Matsuda, Shigeki
Hori, Chiori
COMPUTER SPEECH AND LANGUAGE, 2014, 28 (03) : 709 - 726
[8] AN INVESTIGATION OF LSTM-CTC BASED JOINT ACOUSTIC MODEL FOR INDIAN LANGUAGE IDENTIFICATION
Mandava, Tirusha
Vuddagiri, Ravi Kumar
Vydana, Hari Krishna
Vuppala, Anil Kumar
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 389 - 396

← 1 →