Iterative Hard Thresholding for Keyword Extraction from Large Text Corpora

被引:0
作者
Yadlowsky, Steve [1 ]
Nakkarin, Preetum [1 ]
Wang, Jingyan [1 ]
Sharma, Rishi [1 ]
El Ghaoui, Laurent [1 ]
机构
[1] Univ Calif Berkeley, Elect Engn & Comp Sci, Berkeley, CA 94720 USA
来源
2014 13TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA) | 2014年
关键词
D O I
10.1109/ICMLA.2014.101
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To better understand and analyze text corpora, such as the news, it is often useful to extract keywords that are meaningfully associated with a given topic. A corpus of documents labeled by their topic can be used to approach this as a learning problem. We consider this problem through the lens of statistical text analysis, using bag-of-words frequencies as features for a sparse linear model. We demonstrate, through numerical experiments, that iterative hard thresholding (IHT) is a practical and effective algorithm for keyword-extraction from large text corpora. In fact, our implementation of IHT can quickly analyze more than 800,000 documents, returning keywords comparable to algorithms solving a Lasso problem-formulation, with significantly less computation time. Further, we generalize the analysis of the IHT algorithm to show that it is stable for rank deficient matrices, as those arising from our bag-of-words model often are.
引用
收藏
页码:588 / 593
页数:6
相关论文
共 28 条
[1]  
[Anonymous], 2008, Introduction to information retrieval
[2]  
Berger H, 2004, LECT NOTES ARTIF INT, V3339, P998
[3]   Accelerated iterative hard thresholding [J].
Blumensath, Thomas .
SIGNAL PROCESSING, 2012, 92 (03) :752-756
[4]   Iterative hard thresholding for compressed sensing [J].
Blumensath, Thomas ;
Davies, Mike E. .
APPLIED AND COMPUTATIONAL HARMONIC ANALYSIS, 2009, 27 (03) :265-274
[5]   Iterative Thresholding for Sparse Approximations [J].
Blumensath, Thomas ;
Davies, Mike E. .
JOURNAL OF FOURIER ANALYSIS AND APPLICATIONS, 2008, 14 (5-6) :629-654
[6]  
Byrne C., 2006, ITERATIVE ALGORITHMS
[7]   Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization [J].
Donoho, DL ;
Elad, M .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (05) :2197-2202
[8]   Least angle regression - Rejoinder [J].
Efron, B ;
Hastie, T ;
Johnstone, I ;
Tibshirani, R .
ANNALS OF STATISTICS, 2004, 32 (02) :494-499
[9]   Understanding Large Text Corpora via Sparse Machine Learning [J].
El Ghaoui, Laurent ;
Vu Pham ;
Li, Guan-Cheng ;
Viet-An Duong ;
Srivastava, Ashok ;
Bhaduri, Kanishka .
STATISTICAL ANALYSIS AND DATA MINING, 2013, 6 (03) :221-242
[10]  
El Ghaoui Laurent., 2011, C INTELLIGENT DATA U, P159