Cross-Lingual Classification of Political Texts Using Multilingual Sentence Embeddings

被引:10
作者
Licht, Hauke [1 ]
机构
[1] Univ Cologne, Cologne Ctr Comparat Polit, Inst Polit Sci & European Affairs, Cologne, Germany
关键词
multilingual embedding; multilingual text analysis; supervised machine learning; SENTIMENT ANALYSIS; WORDS; TRANSLATION; POSITIONS;
D O I
10.1017/pan.2022.29
中图分类号
D0 [政治学、政治理论];
学科分类号
0302 ; 030201 ;
摘要
Established approaches to analyze multilingual text corpora require either a duplication of analysts' efforts or high-quality machine translation (MT). In this paper, I argue that multilingual sentence embedding (MSE) is an attractive alternative approach to language-independent text representation. To support this argument, I evaluate MSE for cross-lingual supervised text classification. Specifically, I assess how reliably MSE-based classifiers detect manifesto sentences' topics and positions compared to classifiers trained using bag-ofwords representations of machine-translated texts, and how this depends on the amount of training data. These analyses show that when training data are relatively scarce (e.g., 20K or less-labeled sentences), MSE-based classifiers can be more reliable and are at least no less reliable than their MT-based counterparts. Furthermore, I examine how reliable MSE-based classifiers label sentences written in languages not in the training data, focusing on the task of discriminating sentences that discuss the issue of immigration from those that do not. This analysis shows that compared to the within-language classification benchmark, such "cross-lingual transfer" tends to result in fewer reliability losses when relying on the MSE instead of the MT approach. This study thus presents an important addition to the cross-lingual text analysis toolkit.
引用
收藏
页码:366 / 379
页数:14
相关论文
共 42 条
  • [1] Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
    Artetxe, Mikel
    Schwenk, Holger
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2019, 7 : 597 - 610
  • [2] Three Gaps in Computational Text Analysis Methods for Social Sciences: A Research Agenda
    Baden, Christian
    Pipal, Christian
    Schoonvelde, Martijn
    van der Velden, Mariken A. C. G.
    [J]. COMMUNICATION METHODS AND MEASURES, 2022, 16 (01) : 1 - 18
  • [3] Automated Text Classification of News Articles: A Practical Guide
    Barbera, Pablo
    Boydstun, Amber E.
    Linn, Suzanna
    McMahon, Ryan
    Nagler, Jonathan
    [J]. POLITICAL ANALYSIS, 2021, 29 (01) : 19 - 42
  • [4] Baumgartner F.R., 2019, COMP POLICY AGENDAS
  • [5] Crowd-sourced Text Analysis: Reproducible and Agile Production of Political Data
    Benoit, Kenneth
    Conway, Drew
    Lauderdale, Benjamin E.
    Laver, Michael
    Mikhaylov, Slava
    [J]. AMERICAN POLITICAL SCIENCE REVIEW, 2016, 110 (02) : 278 - 295
  • [6] Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?
    Burscher, Bjorn
    Vliegenthart, Rens
    de Vreese, Claes H.
    [J]. ANNALS OF THE AMERICAN ACADEMY OF POLITICAL AND SOCIAL SCIENCE, 2015, 659 (01) : 122 - 131
  • [7] Reproducible Extraction of Cross-lingual Topics (rectr)
    Chan, Chung-Hong
    Zeng, Jing
    Wessler, Hartmut
    Jungblut, Marc
    Welbers, Kasper
    Bajjalieh, Joseph W.
    van Atteveldt, Wouter
    Althaus, Scott L.
    [J]. COMMUNICATION METHODS AND MEASURES, 2020, 14 (04) : 285 - 305
  • [8] Conneau A., 2020, P 58 ANN M ACL, P8440, DOI [DOI 10.18653/V1/2020.ACL-MAIN.747, 10.18653/v1]
  • [9] Conneau Alexis, 2017, P 2017 C EMP METH NA, DOI DOI 10.18653/V1/D17-1070
  • [10] Automatic translation, context, and supervised learning in comparative politics
    Courtney, Michael
    Breen, Michael
    McMenamin, Iain
    McNulty, Gemma
    [J]. JOURNAL OF INFORMATION TECHNOLOGY & POLITICS, 2020, 17 (03) : 208 - 217