Semi-Supervised Feature Selection Algorithm for Open-World

被引:0
作者
Wang, Feng [1 ]
Wu, Wen-Qiang [1 ]
Liang, Ji-Ye [1 ]
机构
[1] School of Computer and Information Technology, Shanxi University, Taiyuan
来源
Jisuanji Xuebao/Chinese Journal of Computers | 2025年 / 48卷 / 06期
关键词
coupled learning; feature selection; open-world learning; pairwise similarity; semi-supervised learning;
D O I
10.11897/SP.J.1016.2025.01273
中图分类号
学科分类号
摘要
Existing semi-supervised learning methodologies typically operate under the closed-world assumption, wherein category information remains static throughout the learning process; that is, the labeled data utilized for model training encompasses all categories. However, this assumption frequently proves challenging to satisfy in practical applications. The unlabeled data often contain a substantial number of samples that belong to unknown classes. Consequently, researchers have identified a highly demanding research avenue in recent years: extending semi-supervised learning to enable not only the accurate identification of unlabeled data samples from known classes but also the discovery and learning of new, previously unknown classes, thereby establishing a semi-supervised learning framework for open-world scenarios. To tackle this challenge, this paper introduces a semi-supervised feature selection algorithm tailored for open-world scenarios based on categorical data (OpenSSFS). This algorithm integrates coupled learning into the similarity measurement of categorical samples and the relevance analysis of classes relationships, thereby establishing a novel similarity metric and a new class correlation metric. Based on these metrics, the new algorithm systematically constructs three core modules in sequence. The first one is an adaptive pseudo-label generation algorithm for unlabeled known-class data. The second one focuses on granulation and the discovery of novel classes within unlabeled data of unknown categories. And the final one presented is a feature selection algorithm based on classes relevance analysis. For a given open-world dataset, the first step involves computing the feature selection results for the known-class data samples, assigning pseudo-labels to the unlabeled known-class samples using the pseudo-label generation algorithm, and updating the feature selection results by incorporating all the known-class samples. In the second step, new classes within the unlabeled samples of the unknown class are identified, and the feature selection results based on new classes are computed. Finally, by integrating the effective feature subsets from both the known-class and unknown-class samples, the final feature selection outcome is determined. To further validate the effectiveness of the new algorithm proposed in this paper, an open world data environment is simulated in the experimental analysis. The new algorithm is tested and evaluated on the same dataset with varying proportions of known and unknown classes as well as different ratios of labeled and unlabeled samples. The experimental results indicate that the OpenSSFS algorithm has demonstrated excellent classification performance in various scenarios. Firstly, on a dataset comprising 50% known classes and 50% unknown classes, with 50% labeled samples, the new algorithm achieves a classification accuracy improvement of up to nearly 70%, demonstrating significantly superior performance compared to other contrast algorithms. Secondly, as the proportion of labeled samples is reduced from 90% to 10%, the performance of the new algorithm not only surpasses that of other algorithms but also maintains stability without any significant deterioration, thereby demonstrating its considerable robustness. Finally, the experimental findings regarding different proportions of known and unknown classes reveal that, even when there are only a few known classes, the new algorithm can still demonstrate excellent performance and handle more open task scenarios effectively. Furthermore, the experimental analysis carried out an analysis and discussion on the parameter threshold values set within the new algorithm. © 2025 Science Press. All rights reserved.
引用
收藏
页码:1273 / 1289
页数:16
相关论文
共 48 条
[1]  
Yang X L, Song Z X, King I, Et al., A survey on deep semi-supervised learning, IEEE Transactions on Knowledge and Data Engineering, 35, 9, pp. 8934-8954, (2023)
[2]  
Kmita K, Kaczmarek-Majer K, Hryniewicz O., Explainable impact of partial supervision in semi-supervised fuzzy clustering, IEEE Transactions on Fuzzy Systems, 32, 5, pp. 3189-3198, (2024)
[3]  
Wu H M, Li X M, Cheng K T., Exploring feature representation learning for semi-supervised medical image segmentation, IEEE Transactions on Neural Networks and Learning Systems, 35, 11, pp. 16589-16601, (2024)
[4]  
Sun N Z, Luo T J, Zhuge W Z, Et al., Semi-supervised learning with label proportion, IEEE Transactions on Knowledge and Data Engineering, 35, 1, pp. 877-890, (2023)
[5]  
Ruff L, Kauffmann J.R, Vandermeulen R.A, Et al., A unifying review of deep and shallow anomaly detection, Proceedings of the IEEE, 109, 5, pp. 756-795, (2021)
[6]  
Boukerche A, Zheng L N, Alfandi O., Outlier detection: Methods, models, and classification, ACM Computing Surveys, 53, 3, pp. 1-37, (2020)
[7]  
Van Engelen J. E, Hoos H. H., A survey on semi-supervised learning, Machine Learning, 109, 2, pp. 373-440, (2020)
[8]  
Jiang K, Xie W Y, Lei J, Et al., Lren: Low-rank embedded network for sample-free hyperspectral anomaly detection, Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4139-4146, (2021)
[9]  
Geifman Y, El-Yaniv R., Selectivenet: A deep neural network with an integrated reject option, Proceedings of the 36th International Conference on Machine Learning, pp. 2151-2159, (2019)
[10]  
Scheirer W J, de Rezende Rocha A, Et al., Toward open set recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 7, pp. 1757-1772, (2012)