Outlier detection for multinomial data with a large number of categories

被引:1
作者
Yang, Xiaona [1 ,2 ]
Wang, Zhaojun [1 ,2 ]
Zi, Xuemin [3 ]
机构
[1] Nankai Univ, Sch Stat & Data Sci, LPMC, Tianjin 300071, Peoples R China
[2] Nankai Univ, KLMDASR, Tianjin 300071, Peoples R China
[3] Tianjin Univ Technol & Educ, Sch Sci, Tianjin, Peoples R China
关键词
High-breakdown point; high dimension; multinomial data; outlier detection; reweighting; TRIMMED SQUARES REGRESSION; MULTIVARIATE; DEPTH; CHART;
D O I
10.1142/S2010326320500082
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
This paper develops an outlier detection procedure for multinomial data when the number of categories tends to infinity. Most of the outlier detection methods are based on the assumption that the observations follow multivariate normal distribution, while in many modern applications, the observations either are measured on a discrete scale or naturally have some categorical structures. For such multinomial observations, there are rather limited approaches for outlier detection. To overcome the main obstacle, the least trimmed distances estimator for multinomial data and a fast algorithm to identify the clean subset are introduced in this work. Also, a threshold rule is considered through the asymptotic distribution of measure distance to identify outliers. Furthermore, a one-step reweighting scheme is proposed to improve the efficiency of the procedure. Finally, the finite sample performance of our method is evaluated through simulations and is compared with that of available outlier detection methods.
引用
收藏
页数:17
相关论文
共 22 条
[1]   The multivariate least-trimmed squares estimator [J].
Agullo, Jose ;
Croux, Christophe ;
Van Aelst, Stefan .
JOURNAL OF MULTIVARIATE ANALYSIS, 2008, 99 (03) :311-338
[2]   SPARSE LEAST TRIMMED SQUARES REGRESSION FOR ANALYZING HIGH-DIMENSIONAL LARGE DATA SETS [J].
Alfons, Andreas ;
Croux, Christophe ;
Gelper, Sarah .
ANNALS OF APPLIED STATISTICS, 2013, 7 (01) :226-248
[3]  
[Anonymous], 2002, CATEGORICAL DATA ANA, DOI DOI 10.1002/0471249688.CH6
[4]  
Baranov AP, 2005, DISCRET MATH APPL, V15, P211, DOI 10.1515/156939205774464459
[5]   GRAPH-BASED TESTS FOR TWO-SAMPLE COMPARISONS OF CATEGORICAL DATA [J].
Chen, Hao ;
Zhang, Nancy R. .
STATISTICA SINICA, 2013, 23 (04) :1479-1503
[6]   Multivariate Exponentially Weighted Moving-Average Chart for Monitoring Poisson Observations [J].
Chen, Nan ;
Li, Zhonghua ;
Ou, Yanjing .
JOURNAL OF QUALITY TECHNOLOGY, 2015, 47 (03) :252-263
[7]   Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels [J].
Febrero, Manuel ;
Galeano, Pedro ;
Gonzalez-Manteiga, Wenceslao .
ENVIRONMETRICS, 2008, 19 (04) :331-345
[8]   Outlier identification in high dimensions [J].
Filzmoser, Peter ;
Maronna, Ricardo ;
Werner, Mark .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (03) :1694-1711
[9]  
Fritsch V, 2011, LECT NOTES COMPUT SC, V6893, P264, DOI 10.1007/978-3-642-23626-6_33
[10]   A Cluster-Based Outlier Detection Scheme for Multivariate Data [J].
Jobe, J. Marcus ;
Pokojovy, Michael .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2015, 110 (512) :1543-1551