MULTIFIN: A Dataset for Multilingual Financial NLP

被引:0
|
作者
Jorgensen, Rasmus Kaer [1 ,2 ]
Brandt, Oliver
Hartmann, Mareike [4 ,5 ]
Dai, Xiang [3 ]
Igel, Christian [1 ]
Elliott, Desmond [1 ]
机构
[1] Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark
[2] PricewaterhouseCoopers PwC, London, England
[3] CSIRO, Data61, Canberra, Australia
[4] Saarland Univ, Dept Language Sci & Technol, Saarbrucken, Germany
[5] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany
来源
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023 | 2023年
关键词
TEXT;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN- a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multiclass. We develop our annotation schema based on a real-world application and annotate our dataset using both 'label by native-speaker' and 'translate-then-label' approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.
引用
收藏
页码:894 / 909
页数:16
相关论文
共 50 条
  • [41] Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
    Choi, Dami
    Xin, Derrick
    Dadkhahi, Hamid
    Gilmer, Justin
    Garg, Ankush
    Firat, Orhan
    Yeh, Chih-Kuan
    Dai, Andrew M.
    Ghorbani, Behrooz
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [42] Common Phone: A Multilingual Dataset for Robust Acoustic Modelling
    Klumpp, Philipp
    Arias-Vergara, Tomas
    Perez-Toro, Paula-Andrea
    Noeth, Elmar
    Orozco-Arroyave, Juan Rafael
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 763 - 768
  • [43] Multilingual Audio-Visual Smartphone Dataset and Evaluation
    Mandalapu, Hareesh
    Reddy, P. N. Aravinda
    Ramachandra, Raghavendra
    Rao, Krothapalli Sreenivasa
    Mitra, Pabitra
    Prasanna, S. R. Mahadeva
    Busch, Christoph
    IEEE ACCESS, 2021, 9 : 153240 - 153257
  • [44] Dataset Linking in a Multilingual Linked Open Data Context
    Beyene, Melkamu
    Portier, Pierre-Edouard
    Atnafu, Solomon
    Calabretto, Sylvie
    PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON MANAGEMENT OF DIGITAL ECOSYSTEMS (MEDES 2016), 2016, : 149 - 157
  • [45] MULTILINGUAL PHONETIC DATASET FOR LOW RESOURCE SPEECH RECOGNITION
    Li, Xinjian
    Mortensen, David R.
    Metze, Florian
    Black, Alan W.
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6958 - 6962
  • [46] DermaVQA: A Multilingual Visual Question Answering Dataset for Dermatology
    Yim, Wen-wai
    Fu, Yujuan
    Sun, Zhaoyi
    Ben Abacha, Asma
    Yetisgen, Meliha
    Xia, Fei
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT V, 2024, 15005 : 209 - 219
  • [47] TeDDi Sample: Text Data Diversity Sample for Language Comparison and Multilingual NLP
    Moran, Steven
    Bentz, Christian
    Gutierrez-Vasques, Ximena
    Sozinova, Olga
    Samardzic, Tanja
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1150 - 1158
  • [48] DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus
    Bruemmer, Martin
    Dojchinovski, Milan
    Hellmann, Sebastian
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 3339 - 3343
  • [49] imaxin|software: NLP applied to enhance multilingual communications for public organisms and companies
    Pichel, Jose Ramom
    Vazquez, Diego
    Castro, Luz
    Fernandez, Antonio
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2014, (53): : 189 - 192
  • [50] A comprehensive research progress of applying NLP in financial problems
    Ling A.
    Peng W.
    Wang Q.
    Yang X.
    Xitong Gongcheng Lilun yu Shijian/System Engineering Theory and Practice, 2024, 44 (01): : 387 - 406