MULTIFIN: A Dataset for Multilingual Financial NLP

被引:0
|
作者
Jorgensen, Rasmus Kaer [1 ,2 ]
Brandt, Oliver
Hartmann, Mareike [4 ,5 ]
Dai, Xiang [3 ]
Igel, Christian [1 ]
Elliott, Desmond [1 ]
机构
[1] Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark
[2] PricewaterhouseCoopers PwC, London, England
[3] CSIRO, Data61, Canberra, Australia
[4] Saarland Univ, Dept Language Sci & Technol, Saarbrucken, Germany
[5] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany
来源
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023 | 2023年
关键词
TEXT;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN- a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multiclass. We develop our annotation schema based on a real-world application and annotate our dataset using both 'label by native-speaker' and 'translate-then-label' approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.
引用
收藏
页码:894 / 909
页数:16
相关论文
共 50 条
  • [21] A Multilingual Handwritten Character Dataset: T-H-E Dataset
    Bartos, Gaye Ediboglu
    Hoscan, Yasar
    Kauer, Andras
    Hajnal, Eva
    ACTA POLYTECHNICA HUNGARICA, 2020, 17 (09) : 141 - 160
  • [22] An Annotated Multilingual Dataset to Study Modality in the Gospels
    Bermudez-Sabel, Helena
    Dell'Oro, Francesca
    DIGITAL HUMANITIES QUARTERLY, 2024, 18 (01): : 1 - 16
  • [23] SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
    Clark, Elizabeth
    Rijhwani, Shruti
    Gehrmann, Sebastian
    Maynez, Joshua
    Aharoni, Roee
    Nikolaev, Vitaly
    Sellam, Thibault
    Siddhant, Aditya
    Das, Dipanjan
    Parikh, Ankur P.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 9397 - 9413
  • [24] A new dataset for French and multilingual keyphrase generation
    Piedboeuf, Frederic
    Langlais, Philippe
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [25] XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
    Ponti, Edoardo M.
    Glaves, Goran
    Majewska, Olga
    Liu, Qianchu
    Vulic, Ivan
    Korhonen, Anna
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2362 - 2376
  • [26] A Multilingual Evaluation Dataset for MonolingualWord Sense Alignment
    Ahmadi, Sina
    McCrae, John P.
    Nimb, Sanni
    Khan, Fahad
    Monachini, Monica
    Pedersen, Bolette S.
    Declerck, Thierry
    Wissik, Tanja
    Bellandi, Andrea
    Pisani, Irene
    Troelsgard, Thomas
    Olsen, Sussi
    Krek, Simon
    Lipp, Veronika
    Varadi, Tamas
    Simon, Laszlo
    Gyorffy, Andras
    Tiberius, Carole
    Schoonheim, Tanneke
    Ben Moshe, Yifat
    Rudich, Maya
    Abu Ahmad, Raya
    Lonke, Dorielle
    Kovalenko, Kira
    Langemets, Margit
    Kallas, Jelena
    Dereza, Oksana
    Fransen, Theodorus
    Cillessen, David
    Lindemann, David
    Alonso, Mikel
    Salgado, Ana
    Sancho, Jose Luis
    Urena-Ruiz, Rafael-J
    Porta Zamorano, Jordi
    Simov, Kiril
    Osenova, Petya
    Kancheva, Zara
    Radev, Ivaylo
    Stankovic, Ranka
    Perdih, Andrej
    Gabrovsek, Dejan
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3232 - 3242
  • [27] Building a Dataset of Multilingual Cognates for the Romanian Lexicon
    Ciobanu, Alina Maria
    Dinu, Liviu P.
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1038 - 1043
  • [28] Multilingual Topic Classification in X: Dataset and Analysis
    Antypas, Dimosthenis
    Ushio, Asahi
    Barbieri, Francesco
    Camacho-Collados, Jose
    EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2024, : 20136 - 20152
  • [29] Multilingual Entity and Relation Extraction Dataset and Model
    Seganti, Alessandro
    Firlag, Klaudia
    Skowronska, Helena
    Satlawa, Michal
    Andruszkiewicz, Piotr
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1946 - 1955
  • [30] EUROPA: A Legal Multilingual Keyphrase Generation Dataset
    Salaun, Olivier
    Piedboeuf, Frederic
    Le Berre, Guillaume
    Hermelo, David Alfonso
    Langlais, Philippe
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 12718 - 12736