Article Quality Classification on Wikipedia: Introducing Document Embeddings and Content Features

被引:6
作者
Schmidt, Manuel [1 ]
Zangerle, Eva [1 ]
机构
[1] Databases & Informat Syst, Dept Comp Sci, Cologne, Germany
来源
PROCEEDINGS OF THE 15TH INTERNATIONAL SYMPOSIUM ON OPEN COLLABORATION (OPENSYM) | 2019年
关键词
Wikipedia; Collaborative Information Systems; Information Quality; Classification; Gradient Boosted Trees; INFORMATION QUALITY;
D O I
10.1145/3306446.3340831
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The quality of articles on the Wikipedia platform is vital for its success. Currently, the assessment of quality is performed manually by the Wikipedia community, where editors classify articles into pre-defined quality classes. However, this approach is hardly scalable and hence, approaches for the automatic classification have been investigated. In this paper, we extend this previous line of research on article quality classification by extending the set of features with novel content and edit features (e.g., document embeddings of articles). We propose a classification approach utilizing gradient boosted trees based on this novel, extended set of features extracted from Wikipedia articles. Based on an established dataset containingWikipedia articles and quality classes, we show that our approach is able to substantially outperform previous approaches (also including recent deep learning methods). Furthermore, we shed light on the contribution of individual features and show that the proposed features indeed capture the quality of an article well.
引用
收藏
页数:8
相关论文
共 24 条
[1]  
[Anonymous], 2017, WIKIPEDIA TEMPLATE G
[2]  
[Anonymous], 2008, Proceeding of the 17th International Conference on World Wide Web 2008, WWW'08, DOI DOI 10.1145/1367497.1367673
[3]  
[Anonymous], 2007, Proceedings of the 2007 international symposium on Wikis - WikiSym'07, DOI [10.1145/1296951.1296968, DOI 10.1145/1296951.1296968]
[4]  
[Anonymous], 2005, P INT C INF QUAL ICI
[5]   Information Quality in Wikipedia: The Effects of Group Composition and Task Conflict [J].
Arazy, Ofer ;
Nov, Oded ;
Patterson, Raymond ;
Yeo, Lisa .
JOURNAL OF MANAGEMENT INFORMATION SYSTEMS, 2011, 27 (04) :71-98
[6]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794
[7]  
Dalip DH, 2009, ACM-IEEE J CONF DIG, P295
[8]   Quality Assessment of Wikipedia Articles without Feature Engineering [J].
Dang, Quang-Vinh ;
Ignat, Claudia-Lavinia .
2016 IEEE/ACM JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL), 2016, :27-30
[9]  
Dang QV, 2016, 2016 IEEE 2ND INTERNATIONAL CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (IEEE CIC), P266, DOI [10.1109/CIC.2016.42, 10.1109/CIC.2016.044]
[10]  
Dang Quang-Vinh, 2017, OPENSYM 17, P1, DOI DOI 10.1145/3125433.3125448