Disambiguating and Specifying Social Actors in Big Data: Using Wikipedia as a Data Source for Demographic Information

被引:5
作者
Poschmann, Philipp [1 ]
Goldenstein, Jan [1 ]
机构
[1] Friedrich Schiller Univ, Sch Econ & Business Adm, Jena, Germany
关键词
social actor; text analysis; big data; natural language processing; Wikipedia; TEXT ANALYSIS; MEDIA; ORGANIZATIONS; DISCOURSE; COMPLETENESS; KNOWLEDGE; ACCURACY; GRAMMAR; SCIENCE; DBPEDIA;
D O I
10.1177/0049124119882481
中图分类号
O1 [数学]; C [社会科学总论];
学科分类号
03 ; 0303 ; 0701 ; 070101 ;
摘要
Despite the recent and ongoing progress in using text-mining tools to automatically analyze large text corpora, there remains significant potential to facilitate the study of social action in social science research. In this context, particularly the disambiguation (who is referred to in a text?) and specification (which demographic characteristics are present?) of social actors-currently a manual job-remains a challenge. This article demonstrates a reliable and accurate software architecture for social scientists who are interested in automatically detecting, disambiguating, and demographically specifying social actors (i.e., persons and organizations) in large text collections. The backbone of our software architecture is the online encyclopedia Wikipedia as a currently unexploited data source of a large amount of accurately prepared information. We illustrate how our software architecture detects and disambiguates social actors in large text corpora and retrieves their respective demographic information. Overall, we evaluate the reliability and accuracy of our software architecture across seven different social settings and facilitate an intuitive sense of the comprehensive applicability of our software architecture. We end by not only highlighting the benefits of our software architecture for social science research but also pointing to the limitations of using Wikipedia as a data source.
引用
收藏
页码:887 / 925
页数:39
相关论文
共 70 条
  • [1] Wikipedia, sociology, and the promise and pitfalls of Big Data
    Adams, Julia
    Brueckner, Hannah
    [J]. BIG DATA & SOCIETY, 2015, 2 (02):
  • [2] [Anonymous], 2005, P 43 ANN M ASS COMP
  • [3] DBpedia: A nucleus for a web of open data
    Auer, Soeren
    Bizer, Christian
    Kobilarov, Georgi
    Lehmann, Jens
    Cyganiak, Richard
    Ives, Zachary
    [J]. SEMANTIC WEB, PROCEEDINGS, 2007, 4825 : 722 - +
  • [4] Badke W, 2008, ONLINE, V32, P48
  • [5] Baeza-Yates R., 2004, Modern information retrieval
  • [6] Bail C.A., 2015, SOCIOL METHOD RES, V46, P189
  • [7] The Fringe Effect: Civil Society Organizations and the Evolution of Media Discourse about Islam since the September 11th Attacks
    Bail, Christopher A.
    [J]. AMERICAN SOCIOLOGICAL REVIEW, 2012, 77 (06) : 855 - 879
  • [8] Baytiyeh H, 2010, EDUC TECHNOL SOC, V13, P128
  • [9] Bragues G, 2009, MEDIA TROPES, V2, P117
  • [10] Wikipedia as a Data Source for Political Scientists: Accuracy and Completeness of Coverage
    Brown, Adam R.
    [J]. PS-POLITICAL SCIENCE & POLITICS, 2011, 44 (02) : 339 - 343