Performance evaluation of tokenizers in large language models for the Assamese language

被引:0
作者
Sagar Tamang [1 ]
Dibya Jyoti Bora [1 ]
机构
[1] The Assam Kaziranga University, Jorhat
关键词
Assamese; GPT; LLM; SUTRA; Tokenizer; Tokens;
D O I
10.1007/s41870-025-02454-8
中图分类号
学科分类号
摘要
This study evaluates the performance of tokenizers in Large Language Models (LLMs) for the Assamese language, a low-resource language from Northeast India. Tokenization plays a pivotal role in LLMs, influencing their efficiency and adaptability. The research compares tokenizers using Normalized Sequence Length (NSL), and token count. The findings reveal that the SUTRA tokenizer outperforms others with the lowest NSL and token count, highlighting its suitability for Assamese. The study identifies significant gaps in Assamese NLP, emphasizing the need for robust tokenizer evaluations to advance research in low-resource languages. © Bharati Vidyapeeth's Institute of Computer Applications and Management 2025.
引用
收藏
页码:2329 / 2332
页数:3
相关论文
共 26 条
[1]  
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I., Attention is all you need, Advances in Neural Information Processing Systems, 30, pp. 5998-6008, (2017)
[2]  
Islam S., Elmekki H., Elsebai A., Bentahar J., Drawel N., Rjoub G., Pedrycz W., A comprehensive survey on applications of transformers for deep learning tasks, Expert Syst Appl, (2023)
[3]  
Toraman C., Yilmaz E.H., Sahinuc F., Ozcelik O., Impact of tokenization on language models: An analysis for turkish, ACM Trans Asian Low-Resour Lang Inf Process, (2023)
[4]  
Schuster M., Nakajima K., Japanese and korean voice search, In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149-5152, (2012)
[5]  
Hayase J., Liu A., Choi Y., Oh S., Smith N.A., Data Mixture Inference: What Do BPE Tokenizers Reveal about Their Training Data ?, (2024)
[6]  
Sennrich R., Haddow B., Birch A., Neural machine translation of rare words with subword units, Proceedings of the 54Th Annual Meeting of the Associa- Tion for Computational Linguistics, 1, pp. 1715-1725, (2016)
[7]  
Verma K.K., Singh B.M., Dixit A., A review of supervised and unsuper- vised machine learning techniques for suspicious behavior recognition in intelligent surveillance system, Int J Inf Technol, 14, 1, pp. 397-410, (2022)
[8]  
Bojamma A.M., Shastry C., A study on the machine learning techniques for automated plant species identification: current trends and challenges, Int J Inf Technol, 13, 3, pp. 989-995, (2021)
[9]  
Pathak D., Nandi S., Sarmah P., Aspos: Assamese part of speech tagger using deep learning approach, 2022 Ieee/Acs19th International Conference on Computer Systems and Applications (AICCSA), IEEE, Dec, (2022)
[10]  
Kumar R., Bora M.J., Part-of-speech annotation of English-Assamese code- mixed texts: Two approaches, In: Proceedings of the First International Workshop on Language Cognition and Computational Models, pp. 94-103, (2018)