On the role of words in the network structure of texts: Application to authorship attribution

被引:27
作者
Akimushkin, Camilo [1 ]
Amancio, Diego R. [2 ]
Oliveira, Osvaldo N., Jr. [1 ]
机构
[1] Univ Sao Paulo, Sao Carlos Inst Phys, Ave Trabalhador Sao Carlense 400, Sao Carlos, SP, Brazil
[2] Univ Sao Paulo, Inst Math & Comp Sci, Ave Trabalhador Sao Carlense 400, Sao Carlos, SP, Brazil
基金
巴西圣保罗研究基金会;
关键词
Complex networks; Word semantics; Authorship attribution; Similarity measures; Burstiness; Intermittency; COMPLEX NETWORKS; LANGUAGE;
D O I
10.1016/j.physa.2017.12.054
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words, and bigrams. In a recent, alternative approach, medium and large-scale text structures were used in opposition to the belief that text structure is dominated by the language features. In this paper, we introduce a generalized similarity measure to compare texts which accounts for both the network structure of texts and the role of individual words in the networks. The similarity measure is used for authorship attribution of three collections of books, each composed of 8 authors and 10 books per author. High accuracy rates were obtained with typical values between 90% and 98.75%, much higher than with the traditional term frequency-inverse document frequency (tf-idf) approach for the same collections. These accuracies are also higher than those obtained solely with the topology of networks. We conclude that the different properties of specific words on the macroscopic scale structure of a whole text are as relevant as their frequency of appearance; conversely, considering the identity of nodes brings further knowledge about a piece of text represented as a network. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:49 / 58
页数:10
相关论文
共 51 条
[1]   Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks [J].
Akimushkin, Camilo ;
Amancio, Diego Raphael ;
Oliveira, Osvaldo Novais, Jr. .
PLOS ONE, 2017, 12 (01)
[2]   Using complex networks concepts to assess approaches for citations in scientific papers [J].
Amancio, D. R. ;
Nunes, M. G. V. ;
Oliveira, O. N., Jr. ;
Costa, L. da F. .
SCIENTOMETRICS, 2012, 91 (03) :827-842
[3]   Probing the Topological Properties of Complex Networks Modeling Short Written Texts [J].
Amancio, Diego R. .
PLOS ONE, 2015, 10 (02)
[4]   Authorship recognition via fluctuation analysis of network topology and word intermittency [J].
Amancio, Diego R. .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2015,
[5]   Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts [J].
Amancio, Diego R. ;
Oliveira, Osvaldo N., Jr. ;
Costa, Luciano da F. .
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2012, 391 (18) :4406-4419
[6]   A Systematic Comparison of Supervised Classifiers [J].
Amancio, Diego Raphael ;
Comin, Cesar Henrique ;
Casanova, Dalcimar ;
Travieso, Gonzalo ;
Bruno, Odemir Martinez ;
Rodrigues, Francisco Aparecido ;
Costa, Luciano da Fontoura .
PLOS ONE, 2014, 9 (04)
[7]   Three-feature model to reproduce the topology of citation networks and the effects from authors' visibility on their h-index [J].
Amancio, Diego Raphael ;
Oliveira, Osvaldo Novais, Jr. ;
Costa, Luciano da Fontoura .
JOURNAL OF INFORMETRICS, 2012, 6 (03) :427-434
[8]   Comparing intermittency and network measurements of words and their dependence on authorship [J].
Amancio, Diego Raphael ;
Altmann, Eduardo G. ;
Oliveira, Osvaldo N., Jr. ;
Costa, Luciano da Fontoura .
NEW JOURNAL OF PHYSICS, 2011, 13
[9]  
[Anonymous], 2005, International Journal of Digital Evidence
[10]  
[Anonymous], 1935, The Psychobiology of Language