Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing

被引:0
作者
Minh Van Nguyen [1 ]
Viet Lai [1 ]
Ben Veyseh, Amir Pouran [1 ]
Thien Huu Nguyen [1 ]
机构
[1] Univ Oregon, Dept Comp & Informat Sci, Eugene, OR 97403 USA
来源
EACL 2021: THE 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: PROCEEDINGS OF THE SYSTEM DEMONSTRATIONS | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce Trankit, a light-weight Transformer-based Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 pretrained pipelines for 56 languages. Built on a state-of-the-art pretrained language model, Trankit significantly outperforms prior multilingual NLP pipelines over sentence segmentation, part-of-speech tagging, morphological feature tagging, and dependency parsing while maintaining competitive performance for tokenization, multi-word token expansion, and lemmatization over 90 Universal Dependencies treebanks. Despite the use of a large pretrained transformer, our toolkit is still efficient in memory usage and speed. This is achieved by our novel plug-and-play mechanism with Adapters where a multilingual pretrained transformer is shared across pipelines for different languages. Our toolkit along with pretrained models and code are publicly available at: https: //github.com/nlp- uoregon/trankit. A demo website for our toolkit is also available at: http://nlp.uoregon.edu/trankit. Finally, we create a demo video for Trankit at: https://youtu.be/q0KGP3zGjGc.
引用
收藏
页码:80 / 90
页数:11
相关论文
共 43 条
  • [1] Aharoni R, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P3874
  • [2] Akbik A, 2019, NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE DEMONSTRATIONS SESSION, P54
  • [3] Benikova D, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P2524
  • [4] Bojanowski Piotr., 2017, Transactions of the Association for Computational Linguistics, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACLA00051]
  • [5] Che W, 2021, arXiv, DOI DOI 10.48550/ARXIV.2009.11616
  • [6] CHU YJ, 1965, SCI SINICA, V14, P1396
  • [7] Conneau A., 2020, PROC ANN MEET ASS CO, P8440, DOI DOI 10.18653/V1/2020.ACL-MAIN.747
  • [8] Nguyen DQ, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P1037
  • [9] De Meulder F., 2003, P 7 C NATURAL LANGUA, DOI DOI 10.3115/1119176.1119195
  • [10] de Vries W, 2019, Arxiv, DOI arXiv:1912.09582