WHAD: Wikipedia historical attributes data Historical structured data extraction and vandalism detection from the Wikipedia edit history

被引:13
作者
Alfonseca, Enrique [1 ]
Garrido, Guillermo [2 ]
Delort, Jean-Yves [1 ]
Penas, Anselmo [2 ]
机构
[1] Google Res Zurich, Zurich, Switzerland
[2] UNED, NLP & IR Grp, Madrid, Spain
关键词
Wikipedia; Infobox; Attributes; Temporal data;
D O I
10.1007/s10579-013-9232-5
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.
引用
收藏
页码:1163 / 1190
页数:28
相关论文
共 49 条
[1]  
Adler BT, 2011, LECT NOTES COMPUT SC, V6609, P277, DOI 10.1007/978-3-642-19437-5_23
[2]  
Adler B. Thomas, 2010, CLEF 2010 LABS WORKS
[3]  
Ahn David., 2004, P TREC 2004
[4]  
Anderka M., 2012, CLEF ONLINE WORKING
[5]  
[Anonymous], 2007, P 16 ACM C INF KNOWL, DOI DOI 10.1145/1321440.1321449
[6]  
[Anonymous], 2010, P COLING WORKSH PEOP
[7]  
[Anonymous], 2007, Proceedings of the International Symposium on Wikis, DOI [10.1145/1296951.1296968, DOI 10.1145/1296951.1296968]
[8]  
[Anonymous], 2007, AAAI
[9]  
[Anonymous], 2006, INT C PRIV SEC TRUST
[10]  
[Anonymous], CLEF 2010 LABS WORKS