Lasso-based variable selection methods in text regression: the case of short texts

被引:2
作者
Freo, Marzia [1 ]
Luati, Alessandra [2 ,3 ]
机构
[1] European Commiss, Joint Res Ctr JRC, Ispra, Italy
[2] Imperial Coll London, Dept Math, London, England
[3] Univ Bologna, Dept Stat, Bologna, Italy
关键词
Text mining; Lasso; Variable screening; Stability selection; Latent Dirichlet allocation; MODEL SELECTION; REGULARIZATION;
D O I
10.1007/s10182-023-00472-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.
引用
收藏
页码:69 / 99
页数:31
相关论文
共 50 条
  • [41] Factors associated with performing tuberculosis screening of HIV-positive patients in Ghana: LASSO-based predictor selection in a large public health data set
    Susanne Mueller-Using
    Torsten Feldt
    Fred Stephen Sarfo
    Kirsten Alexandra Eberhardt
    BMC Public Health, 16
  • [42] A Lasso approach to covariate selection and average treatment effect estimation for clustered RCTs using design-based methods
    Schochet, Peter Z. Z.
    JOURNAL OF CAUSAL INFERENCE, 2022, 10 (01) : 494 - 514
  • [43] Marginalized LASSO in the low-dimensional difference-based partially linear model for variable selection
    Norouzirad, M.
    Moura, R.
    Arashi, M.
    Marques, F. J.
    JOURNAL OF APPLIED STATISTICS, 2025, 52 (02) : 400 - 428
  • [44] A systematic evaluation of text mining methods for short texts: Mapping individuals' internal states from online posts
    Macanovic, Ana
    Przepiorka, Wojtek
    BEHAVIOR RESEARCH METHODS, 2024, 56 (04) : 2782 - 2803
  • [45] Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data
    Patil, Abhijeet R.
    Kim, Sangjin
    MATHEMATICS, 2020, 8 (01)
  • [46] Performance of variable selection methods for assessing the health effects of correlated exposures in case-control studies
    Lenters, Virissa
    Vermeulen, Roel
    Portengen, Lutzen
    OCCUPATIONAL AND ENVIRONMENTAL MEDICINE, 2018, 75 (07) : 522 - 529
  • [47] Variable selection in linear regression: Several approaches based on normalized maximum likelihood
    Giurcaneanu, Ciprian Doru
    Razavi, Seyed Alireza
    Liski, Antti
    SIGNAL PROCESSING, 2011, 91 (08) : 1671 - 1692
  • [48] Comparison of Phase II Control Charts Based on Variable Selection Methods
    Capizzi, Giovanna
    Masarotto, Guido
    FRONTIERS IN STATISTICAL QUALITY CONTROL 11, 2015, : 151 - 162
  • [49] Robust variable selection of varying coefficient partially nonlinear model based on quantile regression
    Yang, Jing
    Lu, Fang
    Tian, Guoliang
    Lu, Xuewen
    Yang, Hu
    STATISTICS AND ITS INTERFACE, 2019, 12 (03) : 397 - 413
  • [50] Variable selection for censored data with greedy algorithm based adaptive quantile regression models
    Rahaman Khan, Md Hasinur
    Nishat, Md Nasim Saba
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2025,