Parallel Stylometric Document Embeddings with Deep Learning Based Language Models in Literary Authorship Attribution

被引：7

作者：

Skoric, Mihailo ^{[1
]}

Stankovic, Ranka ^{[1
]}

Ikonic Nesic, Milica ^{[2
]}

Byszuk, Joanna ^{[3
]}

Eder, Maciej ^{[3
]}

机构：

[1] Univ Belgrade, Fac Min & Geol, Djusina 7, Belgrade, Serbia

[2] Univ Belgrade, Fac Philol, Studentski Trg 3, Belgrade, Serbia

[3] Polish Acad Sci, Inst Polish Language, Mickiewicza 31, Krakow, Poland

来源：

MATHEMATICS | 2022年 / 10卷 / 05期

基金：

欧盟地平线“2020”;

关键词：

document embeddings; authorship attribution; language modelling; parallel architectures; stylometry; language processing pipelines; NETWORK;

D O I：

10.3390/math10050838

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

This paper explores the effectiveness of parallel stylometric document embeddings in solving the authorship attribution task by testing a novel approach on literary texts in 7 different languages, totaling in 7051 unique 10,000-token chunks from 700 PoS and lemma annotated documents. We used these documents to produce four document embedding models using Stylo R package (word-based, lemma-based, PoS-trigrams-based, and PoS-mask-based) and one document embedding model using mBERT for each of the seven languages. We created further derivations of these embeddings in the form of average, product, minimum, maximum, and l(2) norm of these document embedding matrices and tested them both including and excluding the mBERT-based document embeddings for each language. Finally, we trained several perceptrons on the portions of the dataset in order to procure adequate weights for a weighted combination approach. We tested standalone (two baselines) and composite embeddings for classification accuracy, precision, recall, weighted-average, and macro-averaged F1-score, compared them with one another and have found that for each language most of our composition methods outperform the baselines (with a couple of methods outperforming all baselines for all languages), with or without mBERT inputs, which are found to have no significant positive impact on the results of our methods.

引用

页数：27

共 44 条

[1] On the role of words in the network structure of texts: Application to authorship attribution [J].

Akimushkin, Camilo ;

Amancio, Diego R. ;

Oliveira, Osvaldo N., Jr. .

PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2018, 495 :49-58

[2]

[Anonymous], 2015, P 4 WORKSHOP COMPUTA

[3]

[Anonymous], 2014, WORKING NOTES PAPERS

[4]

Bouanani S.E. M. E., 2014, International Journal of Computer Applications, V86, P22, DOI [10.5120/15038-3384, DOI 10.5120/15038-3384]

[5]

Brunner A., 2020, SWISSTEXT KONVENS

[6]

Burnard L., 2021, Journal of the Text Encoding Initiative, V14, DOI [10.4000/jtei.3500, DOI 10.4000/JTEI.3500]

[7]

Burrows J., 2002, Literary & Linguistic Computing, V17, P267, DOI 10.1093/llc/17.3.267

[8]

Byszuk Joanna., 2020, Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, P100

[9]

Camps J.B., 2020, ARXIV201203845

[10]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 →