MULTIFIN: A Dataset for Multilingual Financial NLP

被引:0
|
作者
Jorgensen, Rasmus Kaer [1 ,2 ]
Brandt, Oliver
Hartmann, Mareike [4 ,5 ]
Dai, Xiang [3 ]
Igel, Christian [1 ]
Elliott, Desmond [1 ]
机构
[1] Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark
[2] PricewaterhouseCoopers PwC, London, England
[3] CSIRO, Data61, Canberra, Australia
[4] Saarland Univ, Dept Language Sci & Technol, Saarbrucken, Germany
[5] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany
来源
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023 | 2023年
关键词
TEXT;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN- a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multiclass. We develop our annotation schema based on a real-world application and annotate our dataset using both 'label by native-speaker' and 'translate-then-label' approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.
引用
收藏
页码:894 / 909
页数:16
相关论文
共 50 条
  • [1] NLP Scholar: A Dataset for Examining the State of NLP Research
    Mohammad, Saif M.
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 868 - 877
  • [2] An approach for resource sharing in multilingual NLP
    Kunze, M
    Xiao, C
    STAIRS 2002, PROCEEDINGS, 2002, 78 : 123 - 124
  • [3] A Multilingual NLP Framework for Offshore Installations
    Sampson, Jennifer
    Koczka, Peter
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 367 - 377
  • [4] Yet Another Suite of Multilingual NLP Tools
    Garcia, Marcos
    Gamallo, Pablo
    LANGUAGES, APPLICATIONS AND TECHNOLOGIES, SLATE 2015, 2015, 563 : 65 - 75
  • [5] Multilingual Image Corpus - Towards a Multimodal and Multilingual Dataset
    Koeva, Svetla
    Stoyanova, Ivelina
    Kralev, Jordan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1509 - 1518
  • [6] A multilingual, multimodal dataset of aggression and bias: the ComMA dataset
    Kumar, Ritesh
    Ratan, Shyam
    Singh, Siddharth
    Nandi, Enakshi
    Devi, Laishram Niranjana
    Bhagat, Akash
    Dawer, Yogesh
    Lahiri, Bornini
    Bansal, Akanksha
    LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 757 - 837
  • [7] Multilingual resources for NLP in the lexical markup framework (LMF)
    Francopoulo, Gil
    Bel, Nuria
    George, Monte
    Calzolari, Nicoletta
    Monachini, Monica
    Pet, Mandy
    Soria, Claudia
    LANGUAGE RESOURCES AND EVALUATION, 2009, 43 (01) : 57 - 70
  • [8] Multilingual Sequence-to-Sequence Models for Hebrew NLP
    Eyal, Matan
    Noga, Hila
    Aharoni, Roee
    Szpektor, Idan
    Tsarfaty, Reut
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 7700 - 7708
  • [9] MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP
    Berard, Alexandre
    Servan, Christophe
    Pietquin, Olivier
    Besacier, Laurent
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 4188 - 4192
  • [10] Multilingual resources for NLP in the lexical markup framework (LMF)
    Gil Francopoulo
    Nuria Bel
    Monte George
    Nicoletta Calzolari
    Monica Monachini
    Mandy Pet
    Claudia Soria
    Language Resources and Evaluation, 2009, 43 : 57 - 70