DC-NMF: nonnegative matrix factorization based on divide-and-conquer for fast clustering and topic modeling

被引:0
作者
Rundong Du
Da Kuang
Barry Drake
Haesun Park
机构
[1] Georgia Institute of Technology,School of Mathematics
[2] University of California,Department of Mathematics
[3] Georgia Institute of Technology,Georgia Tech Research Institute
[4] Georgia Institute of Technology,School of Computational Science and Engineering
来源
Journal of Global Optimization | 2017年 / 68卷
关键词
Constrained low rank approximation; Nonnegative matrix factorization; Divide and conquer; Clustering; Topic modeling; Text analysis; Scalable algorithms;
D O I
暂无
中图分类号
学科分类号
摘要
The importance of unsupervised clustering and topic modeling is well recognized with ever-increasing volumes of text data available from numerous sources. Nonnegative matrix factorization (NMF) has proven to be a successful method for cluster and topic discovery in unlabeled data sets. In this paper, we propose a fast algorithm for computing NMF using a divide-and-conquer strategy, called DC-NMF. Given an input matrix where the columns represent data items, we build a binary tree structure of the data items using a recently-proposed efficient algorithm for computing rank-2 NMF, and then gather information from the tree to initialize the rank-k NMF, which needs only a few iterations to reach a desired solution. We also investigate various criteria for selecting the node to split when growing the tree. We demonstrate the scalability of our algorithm for computing general rank-k NMF as well as its effectiveness in clustering and topic modeling for large-scale text data sets, by comparing it to other frequently utilized state-of-the-art algorithms. The value of the proposed approach lies in the highly efficient and accurate method for initializing rank-k NMF and the scalability achieved from the divide-and-conquer approach of the algorithm and properties of rank-2 NMF. In summary, we present efficient tools for analyzing large-scale data sets, and techniques that can be generalized to many other data analytics problem domains along with an open-source software library called SmallK.
引用
收藏
页码:777 / 798
页数:21
相关论文
共 60 条
  • [1] Blei DM(2003)Latent dirichlet allocation J. Mach. Learn. Res. 3 993-1022
  • [2] Ng AY(2011)Graph regularized nonnegative matrix factorization for data representation IEEE Trans. Pattern Anal. Mach. Intell. 33 1548-1560
  • [3] Jordan MI(2008)Low-dimensional polytope approximation and its applications to nonnegative matrix factorization SIAM J. Sci. Comput. 30 1131-1155
  • [4] Cai D(2009)Fast local algorithms for large scale nonnegative matrix and tensor factorizations IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E92A 708-721
  • [5] He X(1993)Nonnegative ranks, decompositions, and factorizations of nonnegative matrices Linear Algebra Appl 190 149-168
  • [6] Han J(2012)Accelerated multiplicative updates and hierarchical als algorithms for nonnegative matrix factorization Neural Comput. 24 1085-1105
  • [7] Huang TS(2015)Hierarchical clustering of hyperspectral images using rank-two nonnegative matrix factorization IEEE Trans. Geosci. Remote Sens. 53 2066-2078
  • [8] Chu MT(2007)Euclidean embedding of co-occurrence data J. Mach. Learn. Res. 8 2265-2295
  • [9] Lin MM(2000)On the convergence of the block nonlinear Gauss–Seidel method under convex constraints Oper. Res. Lett. 26 127-136
  • [10] Cichocki A(2013)Network-based stratification of tumor mutations Nat. Methods 10 1108-1115