Lasso-based variable selection methods in text regression: the case of short texts

被引:2
作者
Freo, Marzia [1 ]
Luati, Alessandra [2 ,3 ]
机构
[1] European Commiss, Joint Res Ctr JRC, Ispra, Italy
[2] Imperial Coll London, Dept Math, London, England
[3] Univ Bologna, Dept Stat, Bologna, Italy
关键词
Text mining; Lasso; Variable screening; Stability selection; Latent Dirichlet allocation; MODEL SELECTION; REGULARIZATION;
D O I
10.1007/s10182-023-00472-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.
引用
收藏
页码:69 / 99
页数:31
相关论文
共 50 条
  • [31] Variable selection in competing risks models based on quantile regression
    Li, Erqian
    Tian, Maozai
    Tang, Man-Lai
    STATISTICS IN MEDICINE, 2019, 38 (23) : 4670 - 4685
  • [32] Evaluating variable selection methods for multivariable regression models: A simulation study protocol
    Ullmann, Theresa
    Heinze, Georg
    Hafermann, Lorena
    Schilhart-Wallisch, Christine
    Dunkler, Daniela
    PLOS ONE, 2024, 19 (08):
  • [33] Understanding intraday electricity markets: Variable selection and very short-term price forecasting using LASSO
    Uniejewski, Bartosz
    Marcjasz, Grzegorz
    Weron, Rafai
    INTERNATIONAL JOURNAL OF FORECASTING, 2019, 35 (04) : 1533 - 1547
  • [34] Factors associated with performing tuberculosis screening of HIV-positive patients in Ghana: LASSO-based predictor selection in a large public health data set
    Mueller-Using, Susanne
    Feldt, Torsten
    Sarfo, Fred Stephen
    Eberhardt, Kirsten Alexandra
    BMC PUBLIC HEALTH, 2016, 16
  • [35] Automated Bayesian variable selection methods for binary regression models with missing covariate data
    Michael Bergrab
    Christian Aßmann
    AStA Wirtschafts- und Sozialstatistisches Archiv, 2024, 18 (2) : 203 - 244
  • [36] Sequential forward selection and support vector regression in comparison to LASSO regression for spring wheat yield prediction based on UAV imagery
    Shafiee, Sahameh
    Lied, Lars Martin
    Burud, Ingunn
    Dieseth, Jon Arne
    Alsheikh, Muath
    Lillemo, Morten
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2021, 183
  • [37] Variable Selection Based Testing for Parameter Changes in Regression with Autoregressive Dependence
    Horvath, Lajos
    Kokoszka, Piotr
    Lu, Shanglin
    JOURNAL OF BUSINESS & ECONOMIC STATISTICS, 2024, 42 (04) : 1331 - 1343
  • [38] Comparative study of L1 regularized logistic regression methods for variable selection
    El Guide, M.
    Jbilou, K.
    Koukouvinos, C.
    Lappa, A.
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2022, 51 (09) : 4957 - 4972
  • [39] Variable Selection of Generalized Regression Models Based on Maximum Rank Correlation
    Dai, Peng-jie
    Zhang, Qing-zhao
    Sun, Zhi-hua
    ACTA MATHEMATICAE APPLICATAE SINICA-ENGLISH SERIES, 2014, 30 (03): : 833 - 844
  • [40] Mahalanobis distance based similarity regression learning of NIRS for quality assurance of tobacco product with different variable selection methods
    Huo, Juan
    Ma, Yuping
    Lu, Changtong
    Li, Chenggang
    Duan, Kun
    Li, Huaiqi
    SPECTROCHIMICA ACTA PART A-MOLECULAR AND BIOMOLECULAR SPECTROSCOPY, 2021, 251