MULTIFIN: A Dataset for Multilingual Financial NLP

被引:0
|
作者
Jorgensen, Rasmus Kaer [1 ,2 ]
Brandt, Oliver
Hartmann, Mareike [4 ,5 ]
Dai, Xiang [3 ]
Igel, Christian [1 ]
Elliott, Desmond [1 ]
机构
[1] Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark
[2] PricewaterhouseCoopers PwC, London, England
[3] CSIRO, Data61, Canberra, Australia
[4] Saarland Univ, Dept Language Sci & Technol, Saarbrucken, Germany
[5] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany
来源
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023 | 2023年
关键词
TEXT;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN- a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multiclass. We develop our annotation schema based on a real-world application and annotate our dataset using both 'label by native-speaker' and 'translate-then-label' approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.
引用
收藏
页码:894 / 909
页数:16
相关论文
共 50 条
  • [31] REDFM: a Filtered and Multilingual Relation Extraction Dataset
    Cabot, Pere-Lluis Huguet
    Tedeschi, Simone
    Ngomo, Axel-Cyrille Ngonga
    Navigli, Roberto
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4326 - 4343
  • [32] VoxEL: A Benchmark Dataset for Multilingual Entity Linking
    Rosales-Mendez, Henry
    Hogan, Aidan
    Poblete, Barbara
    SEMANTIC WEB - ISWC 2018, PT II, 2018, 11137 : 170 - 186
  • [33] Sentiment Analysis of Multilingual Tweets Based on Natural Language Processing (NLP)
    Bera, Abhijit
    Ghose, Mrinal Kanti
    Pal, Dibyendu Kumar
    INTERNATIONAL JOURNAL OF SYSTEM DYNAMICS APPLICATIONS, 2021, 10 (04)
  • [34] Editor's Note: Special Issue on Educational NLP for a Multilingual World
    Alexandron, Giora
    Klebanov, Beata Beigman
    Komachi, Mamoru
    Zesch, Torsten
    INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2024, : 1293 - 1293
  • [35] NLPOP: a Dataset for Popularity Prediction of Promoted NLP Research on Twitter
    Obadiec, Leo
    Tutek, Martin
    Snajder, Jan
    PROCEEDINGS OF THE 12TH WORKSHOP ON COMPUTATIONAL APPROACHES TO SUBJECTIVITY, SENTIMENT & SOCIAL MEDIA ANALYSIS, 2022, : 286 - 292
  • [36] MultiHumES: Multilingual Humanitarian Response Dataset for Extractive Summarization
    Yela-Bello, Jenny Paola
    Oglethorpe, Ewan
    Rekabsaz, Navid
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1713 - 1717
  • [37] MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
    Brugger, Tobias
    Sturmer, Matthias
    Niklaus, Joel
    PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND LAW, ICAIL 2023, 2023, : 42 - 51
  • [38] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [39] MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset
    Hennig, Leonhard
    Thomas, Philippe
    Moeller, Sebastian
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 3785 - 3801
  • [40] Multilingual character recognition dataset for Moroccan official documents
    Benaissa, Ali
    Bahri, Abdelkhalak
    El Allaoui, Ahmad
    DATA IN BRIEF, 2024, 52