Convex non-negative matrix factorization for massive datasets

被引:34
作者
Thurau, Christian [1 ]
Kersting, Kristian
Wahabzada, Mirwaes
Bauckhage, Christian [1 ]
机构
[1] Fraunhofer IAIS, Visual & Social Media Grp, D-53754 St Augustin, Germany
关键词
Matrix factorization; Low-rank approximation; Data mining; Information retrieval; Large-scale data analysis; SCENE; HULL;
D O I
10.1007/s10115-010-0352-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Non-negative matrix factorization (NMF) has become a standard tool in data mining, information retrieval, and signal processing. It is used to factorize a non-negative data matrix into two non-negative matrix factors that contain basis elements and linear coefficients, respectively. Often, the columns of the first resulting factor are interpreted as "cluster centroids" of the input data, and the columns of the second factor are understood to contain cluster membership indicators. When analyzing data such as collections of gene expressions, documents, or images, it is often beneficial to ensure that the resulting cluster centroids are meaningful, for instance, by restricting them to be convex combinations of data points. However, known approaches to convex-NMF suffer from high computational costs and therefore hardly apply to large-scale data analysis problems. This paper presents a new framework for convex-NMF that allows for an efficient factorization of data matrices of millions of data points. Triggered by the simple observation that each data point can be expressed as a convex combination of vertices of the data convex hull, we require the basic factors to be vertices of the data convex hull. The benefits of convex-hull NMF are twofold. First, for a growing number of data points the expected size of the convex hull, i.e. the number of its vertices, grows much slower than the dataset. Second, distance preserving low-dimensional embeddings allow us to efficiently sample the convex hull and hence to quickly determine candidate vertices. Our extensive experimental evaluation on large datasets shows that convex-hull NMF compares favorably to convex-NMF in terms of both speed and reconstruction quality. We demonstrate that our method can easily be applied to large-scale, real-world datasets, in our case consisting of 750,000 DBLP entries, 4,000,000 digital images, and 150,000,000 votes on World of Warcraft (A (R))guilds, respectively.
引用
收藏
页码:457 / 478
页数:22
相关论文
共 31 条
[1]   On classification and segmentation of massive audio data streams [J].
Aggarwal, Charu C. .
KNOWLEDGE AND INFORMATION SYSTEMS, 2009, 20 (02) :137-156
[2]  
AITCHISON J, 1982, J ROY STAT SOC B, V44, P139
[3]  
[Anonymous], PROC
[4]  
Cai D, 2008, P IEEE INT C DAT MIN
[5]   Non-negative matrix factorization for semi-supervised data clustering [J].
Chen, Yanhua ;
Rege, Manjeet ;
Dong, Ming ;
Hua, Jing .
KNOWLEDGE AND INFORMATION SYSTEMS, 2008, 17 (03) :355-379
[6]   ARCHETYPAL ANALYSIS [J].
CUTLER, A ;
BREIMAN, L .
TECHNOMETRICS, 1994, 36 (04) :338-347
[7]   Convex and Semi-Nonnegative Matrix Factorizations [J].
Ding, Chris ;
Li, Tao ;
Jordan, Michael I. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (01) :45-55
[8]  
Donoho D., 2004, Advances in neural information processing systems, P16
[9]   Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition [J].
Drineas, Petros ;
Kannan, Ravi ;
Mahoney, Michael W. .
SIAM JOURNAL ON COMPUTING, 2006, 36 (01) :184-206
[10]  
FALOUTSOS C, 1995, P ACM SIGMOD C