Feature Selection in the Data Stream Based on Incremental Markov Boundary Learning

被引:21
作者
Wu, Xingyu [1 ]
Jiang, Bingbing [2 ]
Wang, Xiangyu [3 ]
Ban, Taiyu [1 ]
Chen, Huanhuan [1 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei 230027, Peoples R China
[2] Hangzhou Normal Univ, Sch Informat Sci & Engn, Hangzhou 311121, Peoples R China
[3] Univ Sci & Technol China, Sch Data Sci, Hefei 230027, Peoples R China
关键词
Feature extraction; Markov processes; Reliability; Real-time systems; Monitoring; Data mining; Training data; Distribution shift; feature selection; Markov blanket; Markov boundary (MB); prior knowledge; streaming data; CLASSIFICATION; DISCOVERY; BLANKETS; DRIFT;
D O I
10.1109/TNNLS.2023.3249767
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed the proliferation of techniques for streaming data mining to meet the demands of many real-time systems, where high-dimensional streaming data are generated at high speed, increasing the burden on both hardware and software. Some feature selection algorithms for streaming data are proposed to tackle this issue. However, these algorithms do not consider the distribution shift due to nonstationary scenarios, leading to performance degradation when the underlying distribution changes in the data stream. To solve this problem, this article investigates feature selection in streaming data through incremental Markov boundary (MB) learning and proposes a novel algorithm. Different from existing algorithms focusing on prediction performance on off-line data, the MB is learned by analyzing conditional dependence/independence in data, which uncovers the underlying mechanism and is naturally more robust against the distribution shift. To learn MB in the data stream, the proposal transforms the learned information in previous data blocks to prior knowledge and employs them to assist MB discovery in current data blocks, where the likelihood of distribution shift and reliability of conditional independence test are monitored to avoid the negative impact from invalid prior information. Extensive experiments on synthetic and real-world datasets demonstrate the superiority of the proposed algorithm.
引用
收藏
页码:6740 / 6754
页数:15
相关论文
共 56 条
[1]  
Aggarwal C. C., 2003, P 29 INT C VER LARG, P81, DOI DOI 10.1016/B978-012722442-8/50016-1
[2]  
[Anonymous], 1988, Probabilistic Reasoning in Intelligent Systems:Networks of Plausible Inference
[3]  
Demsar J, 2006, J MACH LEARN RES, V7, P1
[4]   Efficient Markov Blanket Discovery and Its Application [J].
Gao, Tian ;
Ji, Qiang .
IEEE TRANSACTIONS ON CYBERNETICS, 2017, 47 (05) :1169-1179
[5]   AI-based modeling and data-driven evaluation for smart manufacturing processes [J].
Ghahramani, Mohammadhossein ;
Qiao, Yan ;
Zhou, MengChu ;
O'Hagan, Adrian ;
Sweeney, James .
IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2020, 7 (04) :1026-1037
[6]   Information-Utilization-Method-Assisted Multimodal Multiobjective Optimization and Application to Credit Card Fraud Detection [J].
Han, Shoufei ;
Zhu, Kun ;
Zhou, MengChu ;
Cai, Xinye .
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2021, 8 (04) :856-869
[7]  
Harel M, 2014, PR MACH LEARN RES, V32, P1009
[8]   Concept drift in Streaming Data Classification: Algorithms, Platforms and Issues [J].
Janardan, Shikha Mehta .
5TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND QUANTITATIVE MANAGEMENT, ITQM 2017, 2017, 122 :804-811
[9]   Robust Adaptive-weighting Multi-view Classification [J].
Jiang, Bingbing ;
Xiang, Junhao ;
Wu, Xingyu ;
He, Wenda ;
Hong, Libin ;
Sheng, Weiguo .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, :3117-3121
[10]  
Jiang BB, 2019, AAAI CONF ARTIF INTE, P3983