A Novel Class Noise Detection Method for High-Dimensional Data in Industrial Informatics

被引:20
作者
Guan, Donghai [1 ,2 ]
Chen, Kai [1 ,2 ]
Han, Guangjie [3 ]
Huang, Shuqiang [4 ]
Yuan, Weiwei [1 ,2 ]
Guizani, Mohsen [5 ]
Shu, Lei [6 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 210016, Peoples R China
[2] Collaborat Innovat Ctr Novel Software Technol & I, Nanjing 210093, Peoples R China
[3] Dalian Univ Technol, Sch Software, Key Lab Ubiquitous Network & Serv Software Liaoni, Dalian 116024, Peoples R China
[4] Jinan Univ, Coll Sci & Engn, Dept Optoelect Engn, Guangzhou 510632, Peoples R China
[5] Qatar Univ, Coll Engn, Doha 2713, Qatar
[6] Nanjing Agr Univ, Coll Engn, Nanjing 210095, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Informatics; Training; Machine learning; Task analysis; Noise measurement; Reliability; High dimension; industrial informatics; noise filtering; OUTLIER DETECTION; CLASSIFICATION; SELECTION; QUALITY;
D O I
10.1109/TII.2020.3012658
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The data in industrial informatics may be high-dimensional and mislabeled. Irrelevant or noisy features pose a significant challenge to the detection of high-dimensional mislabeling. The traditional method usually adopts a two-step solution, first finding the relevant subspace and then using it for mislabeling detection. This two-step method struggles to provide the optimal mislabeling detection performance, since it separates the procedures of feature selection and label error detection. To solve this problem, in this article, we integrate the two steps and propose a sequential ensemble noise filter (SENF). In the SENF, relevant features are selected and used to generate a noise score for each instance. Continuously, these noise scores guide feature selection in the regression learning. Thus, the SENF falls in the scope of sequential ensemble learning. We evaluate our approach on several benchmark datasets with high dimensionality and much label noise. It is shown that the SENF is significantly better than other existing label noise detection methods.
引用
收藏
页码:2181 / 2190
页数:10
相关论文
共 29 条
[1]  
Angelova A, 2005, PROC CVPR IEEE, P494
[2]   Intelligent Quality of Service Aware Traffic Forwarding for Software-Defined Networking/Open Shortest Path First Hybrid Industrial Internet [J].
Bi, Yuanguo ;
Han, Guangjie ;
Lin, Chuan ;
Peng, Yan ;
Pu, Huayan ;
Jia, Yazhou .
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2020, 16 (02) :1395-1405
[3]   Ensembles for feature selection: A review and future trends [J].
Bolon-Canedo, Veronica ;
Alonso-Betanzos, Amparo .
INFORMATION FUSION, 2019, 52 :1-12
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Identifying mislabeled training data [J].
Brodley, CE ;
Friedl, MA .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1999, 11 :131-167
[6]  
Choh Man Teng, 2001, Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference, P269
[7]  
Choh Man Teng, 2000, PRICAI 2000. Topics in Artificial Intelligence. 6th Pacific Rim International Conference on Artificial Intelligence. Proceedings (Lecture Notes in Artificial Intelligence Vol.1886), P188
[8]  
Dua D., 2017, UCI machine learning repository
[9]  
Dubhashi DP, 2009, CONCENTRATION OF MEASURE FOR THE ANALYSIS OF RANDOMIZED ALGORITHMS, P1, DOI 10.1017/CBO9780511581274
[10]  
Folleco A, 2008, PROCEEDINGS OF THE 2008 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, P190