MULTIFIN: A Dataset for Multilingual Financial NLP

被引：0

作者：

Jorgensen, Rasmus Kaer ^{[1
,2
]}

Brandt, Oliver

Hartmann, Mareike ^{[4
,5
]}

Dai, Xiang ^{[3
]}

Igel, Christian ^{[1
]}

Elliott, Desmond ^{[1
]}

机构：

[1] Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark

[2] PricewaterhouseCoopers PwC, London, England

[3] CSIRO, Data61, Canberra, Australia

[4] Saarland Univ, Dept Language Sci & Technol, Saarbrucken, Germany

[5] German Res Ctr Artificial Intelligence DFKI, Kaiserslautern, Germany

来源：

17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023 | 2023年

关键词：

TEXT;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MULTIFIN- a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multiclass. We develop our annotation schema based on a real-world application and annotate our dataset using both 'label by native-speaker' and 'translate-then-label' approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.

引用

页码：894 / 909

页数：16

共 50 条

[31] REDFM: a Filtered and Multilingual Relation Extraction Dataset
Cabot, Pere-Lluis Huguet
Tedeschi, Simone
Ngomo, Axel-Cyrille Ngonga
Navigli, Roberto
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4326 - 4343
[32] VoxEL: A Benchmark Dataset for Multilingual Entity Linking
Rosales-Mendez, Henry
Hogan, Aidan
Poblete, Barbara
SEMANTIC WEB - ISWC 2018, PT II, 2018, 11137 : 170 - 186
[33] Sentiment Analysis of Multilingual Tweets Based on Natural Language Processing (NLP)
Bera, Abhijit
Ghose, Mrinal Kanti
Pal, Dibyendu Kumar
INTERNATIONAL JOURNAL OF SYSTEM DYNAMICS APPLICATIONS, 2021, 10 (04)
[34] Editor's Note: Special Issue on Educational NLP for a Multilingual World
Alexandron, Giora
Klebanov, Beata Beigman
Komachi, Mamoru
Zesch, Torsten
INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2024, : 1293 - 1293
[35] NLPOP: a Dataset for Popularity Prediction of Promoted NLP Research on Twitter
Obadiec, Leo
Tutek, Martin
Snajder, Jan
PROCEEDINGS OF THE 12TH WORKSHOP ON COMPUTATIONAL APPROACHES TO SUBJECTIVITY, SENTIMENT & SOCIAL MEDIA ANALYSIS, 2022, : 286 - 292
[36] MultiHumES: Multilingual Humanitarian Response Dataset for Extractive Summarization
Yela-Bello, Jenny Paola
Oglethorpe, Ewan
Rekabsaz, Navid
16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1713 - 1717
[37] MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
Brugger, Tobias
Sturmer, Matthias
Niklaus, Joel
PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND LAW, ICAIL 2023, 2023, : 42 - 51
[38] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
Wang, Josiah
Figueiredo, Josiel
Specia, Lucia
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
[39] MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset
Hennig, Leonhard
Thomas, Philippe
Moeller, Sebastian
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 3785 - 3801
[40] Multilingual character recognition dataset for Moroccan official documents
Benaissa, Ali
Bahri, Abdelkhalak
El Allaoui, Ahmad
DATA IN BRIEF, 2024, 52

← 1 2 3 4 5 →