Document embeddings learned on various types of n-grams for cross-topic authorship attribution

被引:31
作者
Gomez-Adorno, Helena [1 ]
Posadas-Duran, Juan-Pablo [2 ]
Sidorov, Grigori [1 ]
Pinto, David [3 ]
机构
[1] IPN, CIC, Mexico City, DF, Mexico
[2] IPN, Escuela Super Ingn Mecan & Elect Unidad Zacatenco, Mexico City, DF, Mexico
[3] BUAP, Fac Ciencias Comp, Puebla, Mexico
关键词
Document embeddings; Authorship attribution; Doc2vec; Neural networks; n-Grams;
D O I
10.1007/s00607-018-0587-8
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.
引用
收藏
页码:741 / 756
页数:16
相关论文
共 32 条
[1]   Applying authorship analysis to extremist-group web forum messages [J].
Abbasi, A ;
Chen, HC .
IEEE INTELLIGENT SYSTEMS, 2005, 20 (05) :67-75
[2]  
[Anonymous], DICT ALGORITHMS DATA
[3]  
Bengio Y, 2001, ADV NEUR IN, V13, P932
[4]  
Coulthrad M., 2012, J LAW POLICY, V21, P441
[5]  
Escalante H.J., 2011, COMPUT LINGUIST, P288
[6]  
Gomez-Adorno H, 2015, CLEF 2015 EV LAB CLE, V1391
[7]   Automatic Authorship Detection Using Textual Patterns Extracted from Integrated Syntactic Graphs [J].
Gomez-Adorno, Helena ;
Sidorov, Grigori ;
Pinto, David ;
Vilarino, Darnes ;
Gelbukh, Alexander .
SENSORS, 2016, 16 (09)
[8]  
Iyyer M, 2015, PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1, P1681
[9]   A Convolutional Neural Network for Modelling Sentences [J].
Kalchbrenner, Nal ;
Grefenstette, Edward ;
Blunsom, Phil .
PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, :655-665
[10]   Cross-Genre Authorship Verification Using Unmasking [J].
Kestemont, Mike ;
Luyckx, Kim ;
Daelemans, Walter ;
Crombez, Thomas .
ENGLISH STUDIES, 2012, 93 (03) :340-356