On clustering levels of a hierarchical categorical risk factor

被引:4
作者
Campo, Bavo D. C. [1 ]
Antonio, Katrien [1 ,2 ,3 ,4 ]
机构
[1] Katholieke Univ Leuven, Fac Econ & Business, Leuven, Belgium
[2] Univ Amsterdam, Fac Econ & Business, Amsterdam, Netherlands
[3] Katholieke Univ Leuven, Leuven Res Ctr Insurance & Financial Risk Anal, LRisk, Leuven, Belgium
[4] Katholieke Univ Leuven, Leuven Stat Res Ctr, LStat, Leuven, Belgium
关键词
Clustering; feature engineering; high-cardinality feature; multi-level factor; natural language processing; nested classification; text embeddings; NEGATIVE VARIANCE-COMPONENTS; SELF-ORGANIZING MAP; MODEL SELECTION; K-MEANS; WORKERS; VALIDATION; INJURIES;
D O I
10.1017/S1748499523000283
中图分类号
F8 [财政、金融];
学科分类号
0202 ;
摘要
Handling nominal covariates with a large number of categories is challenging for both statistical and machine learning techniques. This problem is further exacerbated when the nominal variable has a hierarchical structure. We commonly rely on methods such as the random effects approach to incorporate these covariates in a predictive model. Nonetheless, in certain situations, even the random effects approach may encounter estimation problems. We propose the data-driven Partitioning Hierarchical Risk-factors Adaptive Top-down algorithm to reduce the hierarchically structured risk factor to its essence, by grouping similar categories at each level of the hierarchy. We work top-down and engineer several features to characterize the profile of the categories at a specific level in the hierarchy. In our workers' compensation case study, we characterize the risk profile of an industry via its observed damage rates and claim frequencies. In addition, we use embeddings to encode the textual description of the economic activity of the insured company. These features are then used as input in a clustering algorithm to group similar categories. Our method substantially reduces the number of categories and results in a grouping that is generalizable to out-of-sample data. Moreover, we obtain a better differentiation between high-risk and low-risk companies.
引用
收藏
页码:540 / 578
页数:39
相关论文
共 103 条
[1]  
Ahmad A., 2019, LECT NOTES DATA ENG, P478
[2]  
Ai Cheo Yeo, 2001, International Journal of Intelligent Systems in Accounting, Finance and Management, V10, P39, DOI 10.1002/isaf.196
[3]  
[Anonymous], 1996, NACE REV 1 STAT CLAS
[4]  
[Anonymous], 2007, Regression and Multilevel/Hierarchical Models
[5]  
[Anonymous], 1995, Self-organizing maps
[6]  
[Anonymous], 2004, NACE BEL ACT
[7]  
[Anonymous], 2003, University of Washington Technical Report UWCSE030501
[8]  
[Anonymous], 2006, AUSTR NZ STAND IND C
[9]  
[Anonymous], 2008, NACE Rev. 2: Statistical classification of economic activities
[10]  
[Anonymous], 2009, Mixed-Effects Models in S and S-PLUS, DOI DOI 10.1007/BF01313644