LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

被引:5
|
作者
Liu, Hexin [1 ]
Perera, Leibny Paola Garcia [2 ]
Khong, Andy W. H. [1 ]
Styles, Suzy J. [3 ]
Khudanpur, Sanjeev [2 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore, Singapore
[2] Johns Hopkins Univ, CLSP & HLT COE, Baltimore, MD 21218 USA
[3] Nanyang Technol Univ, Sch Social Sci, Psychol, Singapore, Singapore
来源
INTERSPEECH 2022 | 2022年
基金
新加坡国家研究基金会; 美国国家科学基金会;
关键词
Language identification; acoustic phonetics; phonotactics; self-supervised learning; phoneme segmentation; RECOGNITION;
D O I
10.21437/Interspeech.2022-354
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of "phonotactic" embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multi-task optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model optimized only for LID. The visualized confusion matrices imply that our proposed method achieves higher performance on languages of the same cluster in NIST LRE 2017 data than the CNN-Trans model. A comparison between predicted phoneme boundaries and corresponding audio spectrograms illustrates the leveraging of phoneme information for LID.
引用
收藏
页码:2233 / 2237
页数:5
相关论文
共 8 条
  • [1] Effects of acoustic-phonetic detail on cross-language speech production
    Wilson, Colin
    Davidson, Lisa
    Martin, Sean
    JOURNAL OF MEMORY AND LANGUAGE, 2014, 77 : 1 - 24
  • [2] Acoustic-phonetic feature based Kannada dialect identification from vowel sounds
    Chittaragi, Nagaratna B.
    Koolagudi, Shashidhar G.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2019, 22 (04) : 1099 - 1113
  • [3] Phonetic Temporal Neural Model for Language Identification
    Tang, Zhiyuan
    Wang, Dong
    Chen, Yixiang
    Li, Lantian
    Abel, Andrew
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (01) : 134 - 144
  • [4] An Acoustic-Phonetic Model of F0 Likelihood for Vocal Melody Extraction
    Chien, Yu-Ren
    Wang, Hsin-Min
    Jeng, Shyh-Kang
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (09) : 1457 - 1468
  • [5] Fusion of Contrastive Acoustic Models for Parallel Phonotactic Spoken Language Identification
    Sim, Khe Chai
    Li, Haizhou
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 541 - 544
  • [6] Dichotic integration of acoustic-phonetic information: Competition from extraneous formants increases the effect of second-formant attenuation on intelligibility
    Roberts, Brian
    Summers, Robert J.
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2019, 145 (03) : 1230 - 1240
  • [7] Incorporating local information of the acoustic environments to MAP-based feature compensation and acoustic model adaptation
    Tsao, Yu
    Lu, Xugang
    Dixon, Paul
    Hu, Ting-yao
    Matsuda, Shigeki
    Hori, Chiori
    COMPUTER SPEECH AND LANGUAGE, 2014, 28 (03) : 709 - 726
  • [8] AN INVESTIGATION OF LSTM-CTC BASED JOINT ACOUSTIC MODEL FOR INDIAN LANGUAGE IDENTIFICATION
    Mandava, Tirusha
    Vuddagiri, Ravi Kumar
    Vydana, Hari Krishna
    Vuppala, Anil Kumar
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 389 - 396