A universal information theoretic approach to the identification of stopwords

被引:39
作者
Gerlach, Martin [1 ]
Shi, Hanyu [1 ]
Amaral, Luis A. Nunes [1 ,2 ,3 ,4 ]
机构
[1] Northwestern Univ, Dept Chem & Biol Engn, Evanston, IL 60208 USA
[2] Northwestern Univ, Northwestern Inst Complex Syst, Evanston, IL 60208 USA
[3] Northwestern Univ, Dept Phys & Astron, Evanston, IL 60208 USA
[4] Northwestern Univ, Dept Med, Evanston, IL 60208 USA
关键词
TEXT;
D O I
10.1038/s42256-019-0112-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopword lists. This approach is problematic because it cannot be readily generalized across knowledge domains or languages. As a result of the difficulty in rigorously defining stopwords, there have been few systematic studies on the effect of stopword removal on algorithm performance, which is reflected in the ongoing debate on whether to keep or remove stopwords. Here we address this challenge by formulating an information theoretic framework that automatically identifies uninformative words in a corpus. We show that our framework not only outperforms other stopword heuristics, but also allows for a substantial reduction of document size in applications of topic modelling. Our findings can be readily generalized to other bag-of-words-type approaches beyond language such as in the statistical analysis of transcriptomics, audio or image corpora. To better extract meaning from natural language, some less informative words can be removed before a model is trained, which is usually done by using manually curated lists of stopwords. A new information theoretic approach can identify uninformative words automatically and more accurately.
引用
收藏
页码:606 / 612
页数:7
相关论文
共 47 条
[1]   Science Concierge: A Fast Content-Based Recommendation System for Scientific Publications [J].
Achakulvisut, Titipat ;
Acuna, Daniel E. ;
Ruangrong, Tulakan ;
Kording, Konrad .
PLOS ONE, 2016, 11 (07)
[2]  
Adam J, 2017, J HIGH ENERGY PHYS, DOI 10.1007/JHEP02(2017)077
[3]  
Alberts B., 2008, MOL BIOL CELL, V5th ed.
[4]  
[Anonymous], 2011, P 14 INT C ART INT S
[5]  
[Anonymous], 2010, P NEUR INF PROC SYST
[6]  
[Anonymous], 2002, Mallet
[7]  
[Anonymous], 2008, International Journal of Corpus
[8]  
[9]  
[10]