Convex non-negative matrix factorization for massive datasets

被引:0
作者
Christian Thurau
Kristian Kersting
Mirwaes Wahabzada
Christian Bauckhage
机构
[1] Fraunhofer IAIS,
[2] Schloss Birlinghoven,undefined
来源
Knowledge and Information Systems | 2011年 / 29卷
关键词
Matrix factorization; Low-rank approximation; Data mining; Information retrieval; Large-scale data analysis;
D O I
暂无
中图分类号
学科分类号
摘要
Non-negative matrix factorization (NMF) has become a standard tool in data mining, information retrieval, and signal processing. It is used to factorize a non-negative data matrix into two non-negative matrix factors that contain basis elements and linear coefficients, respectively. Often, the columns of the first resulting factor are interpreted as “cluster centroids” of the input data, and the columns of the second factor are understood to contain cluster membership indicators. When analyzing data such as collections of gene expressions, documents, or images, it is often beneficial to ensure that the resulting cluster centroids are meaningful, for instance, by restricting them to be convex combinations of data points. However, known approaches to convex-NMF suffer from high computational costs and therefore hardly apply to large-scale data analysis problems. This paper presents a new framework for convex-NMF that allows for an efficient factorization of data matrices of millions of data points. Triggered by the simple observation that each data point can be expressed as a convex combination of vertices of the data convex hull, we require the basic factors to be vertices of the data convex hull. The benefits of convex-hull NMF are twofold. First, for a growing number of data points the expected size of the convex hull, i.e. the number of its vertices, grows much slower than the dataset. Second, distance preserving low-dimensional embeddings allow us to efficiently sample the convex hull and hence to quickly determine candidate vertices. Our extensive experimental evaluation on large datasets shows that convex-hull NMF compares favorably to convex-NMF in terms of both speed and reconstruction quality. We demonstrate that our method can easily be applied to large-scale, real-world datasets, in our case consisting of 750,000 DBLP entries, 4,000,000 digital images, and 150,000,000 votes on World of Warcraft ®guilds, respectively.
引用
收藏
页码:457 / 478
页数:21
相关论文
共 34 条
[1]  
Aggarwal C(2009)On classification and segmentation of massive audio data streams Knowl Inf Syst 20 137-156
[2]  
Aitchison J(1982)The statistical analysis of compositional data J R Stat Soc B 44 139-177
[3]  
Chen Y(2008)Non-negative matrix factorization for semi-supervised data clustering Knowl Inf Syst 17 355-379
[4]  
Rege M(1994)Archetypal analysis Technometrics 36 338-347
[5]  
Dong M(2009)Convex and semi-nonnegative matrix factorizations IEEE Trans Pattern Anal Mach Intell 32 45-55
[6]  
Hua J(2006), Fast Monte Carlo algorithms III: computing a compressed approixmate matrix decomposition SIAM J Comput 36 184-206
[7]  
Cutler A(2009)The unreasonable effectiveness of data IEEE Intell Syst 24 8-12
[8]  
Breiman L(2004)Non-negative matrix factorization with sparseness constraints J Mach Learn 5 1457-1469
[9]  
Ding C(1999)Limit theorems for the convex hull of random points in higher dimensions Trans Am Math Soc 351 4337-4363
[10]  
Li T(2008)Non-negative matrix factorization: ill-posedness and a geometric algorithm Pattern Recogn 42 918-928