Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It

被引:259
作者
Denny, Matthew J. [1 ]
Spirling, Arthur [2 ]
机构
[1] Penn State Univ, 203 Pond Lab, University Pk, PA 16802 USA
[2] NYU, Off 405,19 West 4th St, New York, NY 10012 USA
基金
美国国家科学基金会;
关键词
statistical analysis of texts; unsupervised learning; descriptive statistics; MODEL; POSITIONS; SELECTION; SUPPORT; WORDS;
D O I
10.1017/pan.2017.44
中图分类号
D0 [政治学、政治理论];
学科分类号
0302 ; 030201 ;
摘要
Despite the popularity of unsupervised techniques for political science text-as-data research, the importance and implications of preprocessing decisions in this domain have received scant systematic attention. Yet, as we show, such decisions have profound effects on the results of real models for real data. We argue that substantive theory is typically too vague to be of use for feature selection, and that the supervised literature is not necessarily a helpful source of advice. To aid researchers working in unsupervised settings, we introduce a statistical procedure and software that examines the sensitivity of findings under alternate preprocessing regimes. This approach complements a researcher's substantive understanding of a problem by providing a characterization of the variability changes in preprocessing choices may induce when analyzing a particular dataset. In making scholars aware of the degree to which their results are likely to be sensitive to their preprocessing decisions, it aids replication efforts.
引用
收藏
页码:168 / 189
页数:22
相关论文
共 41 条
[1]  
[Anonymous], ADV NEURAL INFORM PR, DOI DOI 10.5555/2984093.2984126
[2]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[3]   Model selection: An integral part of inference [J].
Buckland, ST ;
Burnham, KP ;
Augustin, NH .
BIOMETRICS, 1997, 53 (02) :603-618
[4]   From Pork to Policy: The Rise of Programmatic Campaigning in Japanese Elections [J].
Catalinac, Amy .
JOURNAL OF POLITICS, 2016, 78 (01) :1-18
[5]   Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines [J].
D'Orazio, Vito ;
Landis, Steven T. ;
Palmer, Glenn ;
Schrodt, Philip .
POLITICAL ANALYSIS, 2014, 22 (02) :224-242
[6]   Language and Ideology in Congress [J].
Diermeier, Daniel ;
Godbout, Jean-Francois ;
Yu, Bei ;
Kaufmann, Stefan .
BRITISH JOURNAL OF POLITICAL SCIENCE, 2012, 42 :31-55
[7]   The Statistical Crisis in Science [J].
Gelman, Andrew ;
Loken, Eric .
AMERICAN SCIENTIST, 2014, 102 (06) :460-465
[8]   Preregistration of Studies and Mock Reports [J].
Gelman, Andrew .
POLITICAL ANALYSIS, 2013, 21 (01) :40-41
[9]   Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts [J].
Grimmer, Justin ;
Stewart, Brandon M. .
POLITICAL ANALYSIS, 2013, 21 (03) :267-297
[10]   General purpose computer-assisted clustering and conceptualization [J].
Grimmer, Justin ;
King, Gary .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 (07) :2643-2650