Outlier Detection Algorithms Over Fuzzy Data with Weighted Least Squares

被引:5
作者
Nikolova, Natalia [1 ,2 ]
Rodriguez, Rosa M. [3 ]
Symes, Mark [1 ]
Toneva, Daniela [4 ]
Kolev, Krasimir [5 ]
Tenekedjiev, Kiril [1 ,2 ]
机构
[1] Univ Tasmania, Australian Maritime Coll, 1 Maritime Way, Launceston, Tas 7250, Australia
[2] Nikola Vaptsarov Naval Acad Varna, Fac Engn, 73 Vasil Drumev St, Varna 9026, Bulgaria
[3] Univ Jaen, Campus Lagunillas S-N, Jaen 23071, Spain
[4] Tech Univ Varna, Fac Marine Sci & Ecol, 10 Studentska Str, Varna 9010, Bulgaria
[5] Semmelweis Univ, Dept Med Biochem, Ulloi Ut 26, H-1085 Budapest, Hungary
关键词
Regression analysis; Leave-one-out method; Degree of membership; Multiple testing; Benjamini– Hochberg step-up multiple testing; False-discovery rate; LINEAR-REGRESSION ANALYSIS; FALSE DISCOVERY RATE; MODEL; INPUT; SELECTION; CORONARY; THROMBI; SETS;
D O I
10.1007/s40815-020-01049-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the classical leave-one-out procedure for outlier detection in regression analysis, we exclude an observation and then construct a model on the remaining data. If the difference between predicted and observed value is high we declare this value an outlier. As a rule, those procedures utilize single comparison testing. The problem becomes much harder when the observations can be associated with a given degree of membership to an underlying population, and the outlier detection should be generalized to operate over fuzzy data. We present a new approach for outlier detection that operates over fuzzy data using two inter-related algorithms. Due to the way outliers enter the observation sample, they may be of various order of magnitude. To account for this, we divided the outlier detection procedure into cycles. Furthermore, each cycle consists of two phases. In Phase 1, we apply a leave-one-out procedure for each non-outlier in the dataset. In Phase 2, all previously declared outliers are subjected to Benjamini-Hochberg step-up multiple testing procedure controlling the false-discovery rate, and the non-confirmed outliers can return to the dataset. Finally, we construct a regression model over the resulting set of non-outliers. In that way, we ensure that a reliable and high-quality regression model is obtained in Phase 1 because the leave-one-out procedure comparatively easily purges the dubious observations due to the single comparison testing. In the same time, the confirmation of the outlier status in relation to the newly obtained high-quality regression model is much harder due to the multiple testing procedure applied hence only the true outliers remain outside the data sample. The two phases in each cycle are a good trade-off between the desire to construct a high-quality model (i.e., over informative data points) and the desire to use as much data points as possible (thus leaving as much observations as possible in the data sample). The number of cycles is user defined, but the procedures can finalize the analysis in case a cycle with no new outliers is detected. We offer one illustrative example and two other practical case studies (from real-life thrombosis studies) that demonstrate the application and strengths of our algorithms. In the concluding section, we discuss several limitations of our approach and also offer directions for future research.
引用
收藏
页码:1234 / 1256
页数:23
相关论文
共 75 条