A novel minorization-maximization framework for simultaneous feature selection and clustering of high-dimensional count data

被引:1
作者
Zamzami, Nuha [1 ]
Bouguila, Nizar [2 ]
机构
[1] Univ Jeddah, Coll Comp Sci & Engn, Dept Comp Sci & Artificial Intelligence, Jeddah, Saudi Arabia
[2] Concordia Univ, Concordia Inst Informat Syst Engn CIISE, Montreal, PQ, Canada
关键词
Feature saliency; Feature selection; Model selection; Unsupervised learning; Count data; Mixture models; Generalized Dirichlet multinomial; Maximum likelihood; Minorization-maximization; UNSUPERVISED FEATURE-SELECTION; DISCRIMINANT-ANALYSIS; MAXIMUM-LIKELIHOOD; MODEL SELECTION; ALGORITHM; CLASSIFICATION; MIXTURES;
D O I
10.1007/s10044-022-01094-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Count data are commonly exploited in machine learning and computer vision applications; however, they often suffer from the well-known curse of dimensionality, which declines the performance of clustering algorithms dramatically. Feature selection is a major technique for handling a large number of features, which most are often redundant and noisy. In this paper, we propose a probabilistic approach for count data based on the concept of feature saliency in the context of mixture-based clustering using the generalized Dirichlet multinomial distribution. The saliency of irrelevant features is reduced toward zero by minimizing the message length, which equates to doing feature and model selection simultaneously. It is proved that the developed approach is effective in identifying both the optimal number of clusters and the most important features, and so enhancing clustering performance significantly, using a range of challenging applications including text and image clustering.
引用
收藏
页码:91 / 106
页数:16
相关论文
共 50 条
[21]   Feature selection for high-dimensional data [J].
Bolón-Canedo V. ;
Sánchez-Maroño N. ;
Alonso-Betanzos A. .
Progress in Artificial Intelligence, 2016, 5 (2) :65-75
[22]   Model-based clustering of high-dimensional data: A review [J].
Bouveyron, Charles ;
Brunet-Saumard, Camille .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2014, 71 :52-78
[23]   A novel feature learning framework for high-dimensional data classification [J].
Li, Yanxia ;
Chai, Yi ;
Yin, Hongpeng ;
Chen, Bo .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (02) :555-569
[24]   On the scalability of feature selection methods on high-dimensional data [J].
Bolon-Canedo, V. ;
Rego-Fernandez, D. ;
Peteiro-Barral, D. ;
Alonso-Betanzos, A. ;
Guijarro-Berdinas, B. ;
Sanchez-Marono, N. .
KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 56 (02) :395-442
[25]   Design of High-Dimensional Grassmannian Frames via Block Minorization Maximization [J].
Jyothi, R. ;
Babu, Prabhu ;
Stoica, Petre .
IEEE COMMUNICATIONS LETTERS, 2021, 25 (11) :3624-3628
[26]   Genetic programming for feature construction and selection in classification on high-dimensional data [J].
Binh Tran ;
Xue, Bing ;
Zhang, Mengjie .
MEMETIC COMPUTING, 2016, 8 (01) :3-15
[27]   A density-based clustering algorithm for high-dimensional data with feature selection [J].
Qi Xianting ;
Wang Pan .
2016 2ND INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS - COMPUTING TECHNOLOGY, INTELLIGENT TECHNOLOGY, INDUSTRIAL INFORMATION INTEGRATION (ICIICII), 2016, :114-118
[28]   A Clustering-Guided Integer Brain Storm Optimizer for Feature Selection in High-Dimensional Data [J].
Jia Yun-Tao ;
Zhang Wan-Qiu ;
He Chun-Lin .
DISCRETE DYNAMICS IN NATURE AND SOCIETY, 2021, 2021
[29]   Subspace clustering of high-dimensional data: a predictive approach [J].
McWilliams, Brian ;
Montana, Giovanni .
DATA MINING AND KNOWLEDGE DISCOVERY, 2014, 28 (03) :736-772
[30]   FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS [J].
Verleysen, Michel .
NCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEURAL COMPUTATION THEORY AND APPLICATIONS, 2011, :IS23-IS25