Cross-Lingual Classification of Political Texts Using Multilingual Sentence Embeddings

被引：10

作者：

Licht, Hauke ^{[1
]}

机构：

[1] Univ Cologne, Cologne Ctr Comparat Polit, Inst Polit Sci & European Affairs, Cologne, Germany

来源：

POLITICAL ANALYSIS | 2023年 / 31卷 / 03期

关键词：

multilingual embedding; multilingual text analysis; supervised machine learning; SENTIMENT ANALYSIS; WORDS; TRANSLATION; POSITIONS;

D O I：

10.1017/pan.2022.29

中图分类号：

D0 [政治学、政治理论];

学科分类号：

0302 ; 030201 ;

摘要：

Established approaches to analyze multilingual text corpora require either a duplication of analysts' efforts or high-quality machine translation (MT). In this paper, I argue that multilingual sentence embedding (MSE) is an attractive alternative approach to language-independent text representation. To support this argument, I evaluate MSE for cross-lingual supervised text classification. Specifically, I assess how reliably MSE-based classifiers detect manifesto sentences' topics and positions compared to classifiers trained using bag-ofwords representations of machine-translated texts, and how this depends on the amount of training data. These analyses show that when training data are relatively scarce (e.g., 20K or less-labeled sentences), MSE-based classifiers can be more reliable and are at least no less reliable than their MT-based counterparts. Furthermore, I examine how reliable MSE-based classifiers label sentences written in languages not in the training data, focusing on the task of discriminating sentences that discuss the issue of immigration from those that do not. This analysis shows that compared to the within-language classification benchmark, such "cross-lingual transfer" tends to result in fewer reliability losses when relying on the MSE instead of the MT approach. This study thus presents an important addition to the cross-lingual text analysis toolkit.

引用

页码：366 / 379

页数：14

共 42 条

[1] Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
Artetxe, Mikel
Schwenk, Holger
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 : 597 - 610
[2] Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda
Baden, Christian
Pipal, Christian
Schoonvelde, Martijn
van der Velden, Mariken A. C. G.
[J]. COMMUNICATION METHODS AND MEASURES, 2022, 16 (01) : 1 - 18
[3] Automated Text Classification of News Articles: A Practical Guide
Barbera, Pablo
Boydstun, Amber E.
Linn, Suzanna
McMahon, Ryan
Nagler, Jonathan
[J]. POLITICAL ANALYSIS, 2021, 29 (01) : 19 - 42
[4] Baumgartner F.R., 2019, COMP POLICY AGENDAS
[5] Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data
Benoit, Kenneth
Conway, Drew
Lauderdale, Benjamin E.
Laver, Michael
Mikhaylov, Slava
[J]. AMERICAN POLITICAL SCIENCE REVIEW, 2016, 110 (02) : 278 - 295
[6] Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?
Burscher, Bjorn
Vliegenthart, Rens
de Vreese, Claes H.
[J]. ANNALS OF THE AMERICAN ACADEMY OF POLITICAL AND SOCIAL SCIENCE, 2015, 659 (01) : 122 - 131
[7] Reproducible Extraction of Cross-lingual Topics (rectr)
Chan, Chung-Hong
Zeng, Jing
Wessler, Hartmut
Jungblut, Marc
Welbers, Kasper
Bajjalieh, Joseph W.
van Atteveldt, Wouter
Althaus, Scott L.
[J]. COMMUNICATION METHODS AND MEASURES, 2020, 14 (04) : 285 - 305
[8] Conneau A., 2020, P 58 ANN M ACL, P8440, DOI [DOI 10.18653/V1/2020.ACL-MAIN.747, 10.18653/v1]
[9] Conneau Alexis, 2017, P 2017 C EMP METH NA, DOI DOI 10.18653/V1/D17-1070
[10] Automatic translation, context, and supervised learning in comparative politics
Courtney, Michael
Breen, Michael
McMenamin, Iain
McNulty, Gemma
[J]. JOURNAL OF INFORMATION TECHNOLOGY & POLITICS, 2020, 17 (03) : 208 - 217

← 1 2 3 4 5 →