Spoken Language Identification Using Rhythmic Categorization: Syllable-Timed and Stress-Timed

被引:0
作者
Dey, Spandan [1 ]
Saha, Goutam [1 ]
机构
[1] Indian Inst Technol Kharagpur, Dept E&ECE, Kharagpur, W Bengal, India
来源
2024 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS, SPCOM 2024 | 2024年
关键词
Spoken language identification; syllable-timed language; stress-timed language; wav2vec; 2.0; VoxLingua107;
D O I
10.1109/SPCOM60851.2024.10631595
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent spoken language identification (LID) systems have relied on acoustic features or self-supervised representations. These features are also useful for other speech-based classification tasks, such as speaker and emotion recognition. However, these features are not specific to LID and may not capture all of the information relevant to distinguishing between different languages. Following this, our study seeks to enhance LID system performance by incorporating language-specific features. Specifically, we investigate the rhythmic categorization of world languages into syllable-timed and stress-timed classes. We choose seven syllable-timed and seven stress-timed languages across different language families from the VoxLingua107 corpus. We then focus on extracting the acoustic correlates of speech rhythm that can distinguish between syllable-timed and stress-timed languages. We use the mel-frequency cepstral coefficient (MFCC) features and wav2vec 2.0 XLSR extracted representations with the ECAPA-TDNN architecture along with post-pooling embedding fusion using the proposed rhythm metrics to first develop a pre-stage language rhythm classifier. The embeddings derived from this classifier are then utilized during the training of the 14-class language classifier. Our results demonstrate that effective integration of linguistic information yields substantial LID performance improvement, reducing the equal error rate (EER) to 0.89%.
引用
收藏
页数:5
相关论文
共 34 条
[1]  
Abercrombie D., 2019, Elements of General Phonetics
[2]   Language Identification: A Tutorial [J].
Ambikairajah, Eliathamby ;
Li, Haizhou ;
Wang, Liang ;
Yin, Bo ;
Sethu, Vidhyasaharan .
IEEE CIRCUITS AND SYSTEMS MAGAZINE, 2011, 11 (02) :82-108
[3]  
[Anonymous], 2014, ODYSSEY
[4]  
Babu A, 2021, Arxiv, DOI arXiv:2111.09296
[5]   Modelling multi-level prosody and spectral features using deep neural network for an automatic tonal and non-tonal pre-classification-based Indian language identification system [J].
Bhanja, Chuya China ;
Laskar, Mohammad Azharuddin ;
Laskar, Rabul Hussain .
LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (03) :689-730
[6]   STRESS-TIMING AND SYLLABLE-TIMING REANALYZED [J].
DAUER, RM .
JOURNAL OF PHONETICS, 1983, 11 (01) :51-62
[7]  
deeplearning, Deep learning specialization
[8]   ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification [J].
Desplanques, Brecht ;
Thienpondt, Jenthe ;
Demuynck, Kris .
INTERSPEECH 2020, 2020, :3830-3834
[9]   An Overview of Indian Spoken Language Recognition from Machine Learning Perspective [J].
Dey, Spandan ;
Sahidullah, Md ;
Saha, Goutam .
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (06)
[10]   Cross-corpora spoken language identification with domain diversification and generalization [J].
Dey, Spandan ;
Sahidullah, Md ;
Saha, Goutam .
COMPUTER SPEECH AND LANGUAGE, 2023, 81