Assessing machine learning and data imputation approaches to handle the issue of data sparsity in sports forecasting

被引：1

作者：

Wunderlich, Fabian ^{[1
]}

Biermann, Henrik ^{[1
]}

Yang, Weiran ^{[1
,2
]}

Bassek, Manuel ^{[1
]}

Raabe, Dominik ^{[3
]}

Elbert, Nico ^{[4
]}

Memmert, Daniel ^{[1
]}

Garnica Caparros, Marc ^{[1
]}

机构：

[1] German Sport Univ Cologne, Inst Exercise Training & Sport Informat, Sportpark Mungersdorf 6, D-50933 Cologne, Germany

[2] Rhein Westfal TH Aachen, Dept Comp Sci, Templergraben 55, D-52056 Aachen, Germany

[3] Raabe Analyt, Aachener Str 1041, D-50858 Cologne, Germany

[4] Julius Maximilian Univ Wurzburg, Fac Business Management & Econ, Sanderring 2, D-97070 Wurzburg, BY, Germany

来源：

MACHINE LEARNING | 2025年 / 114卷 / 02期

关键词：

Soccer; Prediction; Deep learning; LSTM; Elo rating; Data sparsity; HOME ADVANTAGE; FOOTBALL; LEAGUES; MATCHES; TEAMS; MODEL; ODDS;

D O I：

10.1007/s10994-024-06651-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Sparsity is a common characteristic for datasets used in the domain of sports forecasting, mainly derived from inconsistencies in data coverage. Typically, this issue is circumvented by cutting the number of features (depth-focused) or the sample size (breadth-focused) for analysis. The present study uses an experimental approach to analyse the effects of depth- or breadth-focused analyses and data imputation to enable usage of the full sample size and feature wealth. Two forecasting models following a hybrid (i.e., a combination of classical statistical and machine learning) and a full deep learning approach are introduced to perform experiments on a dataset of more than 300,000 soccer matches. In contrast to typical soccer forecasting studies, the analysis was not restricted to one-match-ahead forecasts but used a longer forecasting horizon of up to two months ahead. Systematic differences between the two types of models were identified. The hybrid model based on classical statistical rating models, performs strongly on depth-focused approaches while not or only marginally improving for approaches with high data breadth. The deep learning model, however, performs weakly in a depth-focused approach but profits strongly from data breadth. The improved prediction performance in cases of high data breadth suggests that a rich feature set offers better training opportunities than a less detailed set with a larger sample size. Additionally, we showcase that data imputation can be used to address data sparsity by enabling full data depth and breadth. The presented findings are relevant for advancing predictive accuracy and sports forecasting methodologies, emphasizing the viability of imputation techniques to increase data coverage in different analytical approaches.

引用

页数：28

共 65 条

[1] A Review of Hot Deck Imputation for Survey Non-response [J].