Convex Non-Negative Matrix Factorization in the Wild

被引:21
作者
Thurau, Christian [1 ]
Kersting, Kristian [1 ]
Bauckhage, Christian [1 ]
机构
[1] Fraunhofer IAIS, Schloss Birlinghoven, Sankt Augustin, Germany
来源
2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING | 2009年
关键词
data mining; matrix decomposition; data handling; non negative matrix factorization; archetypal analysis; social network analysis;
D O I
10.1109/ICDM.2009.55
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Non-negative matrix factorization (NMF) has recently received a lot of attention in data mining, information retrieval, and computer vision. It factorizes a non-negative input matrix V into two non-negative matrix factors V = WH such that W describes "clusters" of the datasets. Analyzing genotypes, social networks, or images, it can be beneficial to ensure V to contain meaningful "cluster centroids", i.e., to restrict W to be convex combinations of data points. But how can we run this convex NMF in the wild, i.e., given millions of data points? Triggered by the simple observation that each data point is a convex combination of vertices of the data convex hull, we propose to restrict W further to be vertices of the convex hull. The benefits of this convex-hull NMF approach are twofold. First, the expected size of the convex hull of, for example, n random Gaussian points in the plane is Omega(root log n), i.e., the candidate set typically grows much slower than the data set. Second, distance preserving low-dimensional embeddings allow one to compute candidate vertices efficiently. Our extensive experimental evaluation shows that convex-hull NMF compares favorably to convex NMF for large data sets both in terms of speed and reconstruction quality. Moreover, we show that our method can easily be applied to large-scale, real-world data sets, in our case consisting of 1.6 million images respectively 150 million votes on World of Warcraft (R) guilds.
引用
收藏
页码:523 / 532
页数:10
相关论文
共 22 条
[1]  
AITCHISON J, 1982, J ROY STAT SOC B, V44, P139
[2]  
[Anonymous], PROC
[3]  
[Anonymous], 2008, P IEEE C COMP VIS PA
[4]  
[Anonymous], P IEEE INT C DAT MIN
[5]  
Cai D, 2008, P IEEE INT C DAT MIN
[6]   ARCHETYPAL ANALYSIS [J].
CUTLER, A ;
BREIMAN, L .
TECHNOMETRICS, 1994, 36 (04) :338-347
[7]  
DEBERG M, 2000, COMPUTATIONAL GEOMET, pCH6
[8]  
DING CH, IEEE T PATTERN ANAL, V99
[9]   Neighborliness of randomly projected simplices in high dimensions [J].
Donoho, DL ;
Tanner, J .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2005, 102 (27) :9452-9457
[10]  
Faloutsos C., 1995, P ACM SIGMOD