Lasso-based variable selection methods in text regression: the case of short texts

被引:2
作者
Freo, Marzia [1 ]
Luati, Alessandra [2 ,3 ]
机构
[1] European Commiss, Joint Res Ctr JRC, Ispra, Italy
[2] Imperial Coll London, Dept Math, London, England
[3] Univ Bologna, Dept Stat, Bologna, Italy
关键词
Text mining; Lasso; Variable screening; Stability selection; Latent Dirichlet allocation; MODEL SELECTION; REGULARIZATION;
D O I
10.1007/s10182-023-00472-0
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Communication through websites is often characterised by short texts, made of few words, such as image captions or tweets. This paper explores the class of supervised learning methods for the analysis of short texts, as an alternative to unsupervised methods, widely employed to infer topics from structured texts. The aim is to assess the effectiveness of text data in social sciences, when they are used as explanatory variables in regression models. To this purpose, we compare different variable selection procedures when text regression models are fitted to real, short, text data. We discuss the results obtained by several variants of lasso, screening-based methods and randomisation-based models, such as sure independence screening and stability selection, in terms of number and importance of selected variables, assessed through goodness-of-fit measures, inclusion frequency and model class reliance. Latent Dirichlet allocation results are also considered as a term of comparison. Our perspective is primarily empirical and our starting point is the analysis of two real case studies, though bootstrap replications of each dataset are considered. The first case study aims at explaining price variations based on the information contained in the description of items on sale on e-commerce platforms. The second regards open questions in surveys on satisfaction ratings. The case studies are different in nature and representative of different kinds of short texts, as, in one case, a concise descriptive text is considered, whereas, in the other case, the text expresses an opinion.
引用
收藏
页码:69 / 99
页数:31
相关论文
共 50 条
  • [21] LASSO-type variable selection methods for high-dimensional data
    Fu, Guanghui
    Wang, Pan
    ADVANCES IN COMPUTATIONAL MODELING AND SIMULATION, PTS 1 AND 2, 2014, 444-445 : 604 - 609
  • [22] Comparison of Variable Selection Methods for Forecasting from Short Time Series
    McGee, Monnie
    Yaffee, Robert A.
    2019 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA 2019), 2019, : 531 - 540
  • [23] Variable selection and forecasting via automated methods for linear models: LASSO/adaLASSO and Autometrics
    Epprecht, Camila
    Guegan, Dominique
    Veiga, Alvaro
    da Rosa, Joel Correa
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2021, 50 (01) : 103 - 122
  • [24] Variable selection for linear regression in large databases: exact methods
    Pacheco, Joaquin
    Casado, Silvia
    APPLIED INTELLIGENCE, 2021, 51 (06) : 3736 - 3756
  • [25] Unified methods for variable selection and outlier detection in a linear regression
    Seo, Han Son
    COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS, 2019, 26 (06) : 575 - 582
  • [26] Variable selection in regression-based estimation of dynamic treatment regimes
    Bian, Zeyu
    Moodie, Erica E. M.
    Shortreed, Susan M.
    Bhatnagar, Sahir
    BIOMETRICS, 2023, 79 (02) : 988 - 999
  • [27] Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection
    Li, Zitong
    Sillanpaa, Mikko J.
    THEORETICAL AND APPLIED GENETICS, 2012, 125 (03) : 419 - 435
  • [28] Bayesian Variable Selection Methods for Matched Case-Control Studies
    Asafu-Adjei, Josephine
    Tadesse, Mahlet G.
    Coull, Brent
    Balasubramanian, Raji
    Lev, Michael
    Schwamm, Lee
    Betensky, Rebecca
    INTERNATIONAL JOURNAL OF BIOSTATISTICS, 2017, 13 (01)
  • [29] Rank-based Lasso - efficient methods for high-dimensional robust model selection
    Rejchel, Wojciech
    Bogdan, Malgorzata
    JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21
  • [30] Diurnal variation of indoor air pollutants and their influencing factors in educational buildings: A case study using LASSO-based ANNs
    Zhang, He
    Srinivasan, Ravi
    Yang, Xu
    Ganesan, Vikram
    Zhang, Han
    ATMOSPHERIC ENVIRONMENT, 2024, 333