Stylistics analysis and authorship attribution algorithms based on self-organizing maps

被引:16
作者
Neme, Antonio [1 ,2 ]
Pulido, J. R. G. [3 ]
Munoz, Abril [4 ]
Hernandez, Sergio [5 ]
Dey, Teresa [6 ]
机构
[1] Univ Autonoma Ciudad Mexico, Complex Syst Grp, Mexico City, DF, Mexico
[2] Inst Mol Med Finland, Helsinki 00270, Finland
[3] Univ Colima, Fac Telemat, Colima, Mexico
[4] CINVESTAV IDS, Mexico City, DF, Mexico
[5] Univ Autonoma Ciudad Mexico, Postgrad Program Complex Syst, Mexico City, DF, Mexico
[6] Univ Autonoma Ciudad Mexico, Fac Literary Creat, Mexico City, DF, Mexico
关键词
Computational stylistics; Authorship attribution; Self-organizing maps; Anomaly detection; Feature selection; MUTUAL INFORMATION; NOVELTY DETECTION; SELECTION;
D O I
10.1016/j.neucom.2014.03.064
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The style followed by authors can be thought of as a collection of attributes that defines the stylistics space. Texts from the same author tend to be similar in that space. However, the identification of stylistics spaces has proven to be challenging. Associated with the stylistics space is the authorship attribution task. On it, a text of unknown authorship is presented to a system, and the system is expected to identify the author of the text. Two modules define an authorship attribution algorithm: the stylistics space and a classifier. We present a methodology that includes both, a module that allows the identification of novel stylistics spaces, and a classifier to confront the authorship attribution task from the features that define space. The methodology imbricates feature selection, anomaly detection, classification, and visualization algorithms. We applied the capabilities of self-organizing maps not only for visualization but also for anomaly detection, which defines the basis of the classifier. We compared our authorship attribution algorithm with two existing ones. Our methodology achieved similar or better results under bag-of-words-related stylistics spaces, and it presented the lowest error under a novel stylistics space based on the rate of introduction of new words. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:147 / 159
页数:13
相关论文
共 49 条
[1]  
Abarbanel H.D.I., 1996, ANAL OBSERVED CHAOTI
[2]   Applying authorship analysis to extremist-group web forum messages [J].
Abbasi, A ;
Chen, HC .
IEEE INTELLIGENT SYSTEMS, 2005, 20 (05) :67-75
[3]  
Barreto GA, 2009, LECT NOTES COMPUT SC, V5629, P28, DOI 10.1007/978-3-642-02397-2_4
[4]   An example of mathematical authorship attribution [J].
Basile, Chiara ;
Benedetto, Dario ;
Caglioti, Emanuele ;
Esposti, Mirko Degli .
JOURNAL OF MATHEMATICAL PHYSICS, 2008, 49 (12)
[5]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[6]  
Canter David, 1992, EXPERT EVIDENCE, V1, P93
[7]   Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms [J].
Cellucci, CJ ;
Albano, AM ;
Rapp, PE .
PHYSICAL REVIEW E, 2005, 71 (06)
[8]   Theoretical aspects of the SOM algorithm [J].
Cottrell, M ;
Fort, JC ;
Pagès, G .
NEUROCOMPUTING, 1998, 21 (1-3) :119-138
[9]   The effects of very early Alzheimer's disease on the characteristics of writing by a renowned author [J].
Garrard, P ;
Maloney, LM ;
Hodges, JR ;
Patterson, K .
BRAIN, 2005, 128 :250-260
[10]   Species independence of mutual information in coding and noncoding DNA [J].
Grosse, I ;
Herzel, H ;
Buldyrev, SV ;
Stanley, HE .
PHYSICAL REVIEW E, 2000, 61 (05) :5624-5629