Investigating response behavior through TF-IDF and Word2vec text analysis: A case study of PISA 2012 problem-solving process data

被引：1

作者：

Zhou, Jing ^{[1
]}

Ye, Zhanliang ^{[1
]}

Zhang, Sheng ^{[1
]}

Geng, Zhao ^{[1
]}

Han, Ning ^{[1
]}

Yang, Tao ^{[1
]}

机构：

[1] Beijing Normal Univ, Collaborat Innovat Ctr Assessment Basic Educ Qual, 19 XinJieKouWai St, Beijing 100875, Peoples R China

来源：

HELIYON | 2024年 / 10卷 / 16期

关键词：

Problem-solving; Process data; Feature extraction; TF-IDF; Word2vec; Machine learning; COMPUTER-BASED ASSESSMENT;

D O I：

10.1016/j.heliyon.2024.e35945

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The process data in computer-based problem-solving evaluation is rich in valuable implicit information. However, its diverse and irregular structure poses challenges for effective feature extraction, leading to varying degrees of information loss in existing methods. Process-response behavior exhibits similarities to textual data in terms of the key units and contextual relationships. Despite the scarcity of relevant research, exploring text analysis methods for feature recognition in process data is significant. This study investigated the efficacy of Term Frequency- Inverse Document Frequency (TF-IDF) and Word to Vector (Word2vec) in extracting response behavior features and compared the predictive, analytical, and clustering effects of classical machine learning methods (supervised and unsupervised) on response behavior. An analysis of the PISA 2012 computer-based problem-solving dataset revealed that TF-IDF effectively extracted key response behaviors, whereas Word2vec captured effective features from sequenced response behaviors. In addition, in supervised machine learning using both methods, the random forest model based on TF-IDF performed the best, followed by the SVM model based on Word2vec. Word2vec-based models outperformed TF-IDF-based ones in the F1-score, accuracy, and recall (except for precision) across the logistic regression, k-nearest neighbor, and support vector machine algorithms. In unsupervised machine learning, the k-means algorithm effectively clustered different response behavior patterns extracted by these methods. The findings underscore the theoretical and methodological transferability of these text analysis methods in educational and psychological assessment contexts. This study offers valuable insights for research and practice in similar domains by yielding rich feature representations, supplementing fine-grained assessment evidence, fostering personalized learning, and introducing novel insights for educational assessment.

引用

页数：22

共 3 条

[1] Text classification algorithm of tourist attractions subcategories with modified TF-IDF and Word2Vec
Xiao, Lu
Li, Qiaoxing
Ma, Qian
Shen, Jiasheng
Yang, Yong
Li, Danyang
PLOS ONE, 2024, 19 (10):
[2] A study of damp-heat syndrome classification Using Word2vec and TF-IDF
Zhu, Wei
Zhang, Wei
Li, Guo-Zheng
He, Chong
Zhang, Lei
2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2016, : 1415 - 1420
[3] Comparative Analysis of Machine Learning Algorithms for Email Phishing Detection Using TF-IDF, Word2Vec, and BERT
Al Tawil, Arar
Almazaydeh, Laiali
Qawasmeh, Doaa
Qawasmeh, Baraah
Alshinwan, Mohammad
Elleithy, Khaled
CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 81 (02): : 3395 - 3412

← 1 →