Reproducible Extraction of Cross-lingual Topics (rectr)

被引:20
作者
Chan, Chung-Hong [1 ]
Zeng, Jing [2 ]
Wessler, Hartmut [3 ]
Jungblut, Marc [4 ]
Welbers, Kasper [5 ]
Bajjalieh, Joseph W. [6 ]
van Atteveldt, Wouter [5 ]
Althaus, Scott L. [6 ]
机构
[1] Univ Mannheim, Mannheimer Zentrum Europa Sozialforsch, D-68131 Mannheim, Germany
[2] Univ Zurich, Dept Commun & Media Res, Zurich, Switzerland
[3] Univ Mannheim, Inst Media & Commun Studies, Mannheim, Germany
[4] LMU Munchen, Dept Media & Commun, Munich, Germany
[5] Vrije Univ Amsterdam, Dept Commun Sci, Amsterdam, Netherlands
[6] Univ Illinois, Cline Ctr Adv Social Res, Urbana, IL USA
基金
美国人文基金会;
关键词
SENTIMENT ANALYSIS; TEXT; TRANSLATION;
D O I
10.1080/19312458.2020.1812555
中图分类号
G2 [信息与知识传播];
学科分类号
05 ; 0503 ;
摘要
With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method - Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.
引用
收藏
页码:285 / 305
页数:21
相关论文
共 44 条
[1]  
[Anonymous], 2009, PROC C EMPIRICAL MET
[2]   Media Ownership and News Coverage of International Conflict [J].
Baum, Matthew A. ;
Zhukov, Yuri M. .
POLITICAL COMMUNICATION, 2019, 36 (01) :36-63
[3]  
Benoit K., 2018, Journal of Open Source Software, V3, P774, DOI [DOI 10.21105/JOSS.00774, 10.21105/joss.00774]
[4]   Probabilistic Topic Models [J].
Blei, David M. .
COMMUNICATIONS OF THE ACM, 2012, 55 (04) :77-84
[5]   TAKING STOCK OF THE TOOLKIT An overview of relevant automated content analysis approaches and techniques for digital journalism scholars [J].
Boumans, Jelle W. ;
Trilling, Damian .
DIGITAL JOURNALISM, 2016, 4 (01) :8-23
[6]  
Chang J., 2009, Advances in neural information processing systems, V22, P288
[7]   An MTurk Crisis? Shifts in Data Quality and the Impact on Study Results [J].
Chmielewski, Michael ;
Kucker, Sarah C. .
SOCIAL PSYCHOLOGICAL AND PERSONALITY SCIENCE, 2020, 11 (04) :464-473
[8]  
Conneau Alexis, 2017, ARXIV171004087
[9]   No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications [J].
de Vries, Erik ;
Schoonvelde, Martijn ;
Schumacher, Gijs .
POLITICAL ANALYSIS, 2018, 26 (04) :417-430
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171