Computer Based Stylometric Analysis of Texts in Polish Language

被引:4
作者
Baj, Maciej [1 ]
Walkowiak, Tomasz [1 ]
机构
[1] Wroclaw Univ Sci & Technol, Fac Elect, Wybrzeze Wyspianskiego 27, PL-50370 Wroclaw, Poland
来源
ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2017, PT II | 2017年 / 10246卷
关键词
Stylometric; Polish; Text analysis; Classification; Machine learning;
D O I
10.1007/978-3-319-59060-8_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The aim of the paper is to compare stylometric methods in a task of authorship, author gender and literacy period recognition for texts in Polish language. Different feature selection and classification methods were analyzed. Features sets include common words (the most common, the rarest and all words) and grammatical classes frequencies, as well as simple statistics of selected characters, words and sentences. Due to the fact that Polish is a highly inflected language common words features are calculated as the frequencies of the lexemes obtained by morpho-syntactic tagger for Polish. Nine different classifiers were analysed. Authors tested proposed methods on a set of Polish novels. Recognition was done on whole novels and chunked texts. Performed experiments showed that the best results are obtained for features based on all words. For ill defined problems (with small recognition accuracy) the random forest classifier gave the best results. In other cases (for tasks with medium or high recognition accuracy) the multilayer perceptron and the linear regression learned by stochastic gradient descent gave the best results. Moreover, the paper includes an analysis of statistical importance of used features.
引用
收藏
页码:3 / 12
页数:10
相关论文
共 23 条
[1]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[2]  
Burrows J., 2002, Literary & Linguistic Computing, V17, P267, DOI 10.1093/llc/17.3.267
[3]  
Canales O., 2011, PROC STUDENT FAC RES
[4]  
Craig Hugh., 2009, Shakespeare, Computers, and the Mystery of Authorship
[5]  
Crammer K, 2006, J MACH LEARN RES, V7, P551
[6]  
de Vel O, 2001, SIGMOD REC, V30, P55, DOI 10.1145/604264.604272
[7]  
Eder M., 2011, STUDIES POLISH LINGU, V6, P99
[8]  
Eder M., 2017, COGN STUD IN PRESS, V17
[9]  
Fomenko A. T., 2005, HIST FICTION SCI, P425
[10]   Gene selection for cancer classification using support vector machines [J].
Guyon, I ;
Weston, J ;
Barnhill, S ;
Vapnik, V .
MACHINE LEARNING, 2002, 46 (1-3) :389-422