Multilayer networks for text analysis with multiple data types

被引:11
作者
Hyland, Charles C. [1 ]
Tao, Yuanming [1 ]
Azizi, Lamiae [1 ]
Gerlach, Martin [2 ]
Peixoto, Tiago P. [3 ,4 ]
Altmann, Eduardo G. [1 ]
机构
[1] Univ Sydney, Sch Math & Stat, Sydney, NSW 2006, Australia
[2] Wikimedia Fdn, San Francisco, CA USA
[3] Cent European Univ, Dept Network & Data Sci, Quellenstr 51, A-1100 Vienna, Austria
[4] Univ Bath, Dept Math Sci, Bath BA2 7AY, Avon, England
关键词
Stochastic block models; Multilayer networks; Natural language processing; Complex systems; Data science;
D O I
10.1140/epjds/s13688-021-00288-5
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
We are interested in the widespread problem of clustering documents and finding topics in large collections of written documents in the presence of metadata and hyperlinks. To tackle the challenge of accounting for these different types of datasets, we propose a novel framework based on Multilayer Networks and Stochastic Block Models. The main innovation of our approach over other techniques is that it applies the same non-parametric probabilistic framework to the different sources of datasets simultaneously. The key difference to other multilayer complex networks is the strong unbalance between the layers, with the average degree of different node types scaling differently with system size. We show that the latter observation is due to generic properties of text, such as Heaps' law, and strongly affects the inference of communities. We present and discuss the performance of our method in different datasets (hundreds of Wikipedia documents, thousands of scientific papers, and thousands of E-mails) showing that taking into account multiple types of information provides a more nuanced view on topic- and document-clusters and increases the ability to predict missing links.
引用
收藏
页数:16
相关论文
共 45 条
[1]   Properties of the Binary Black Hole Merger GW150914 [J].
Abbott, B. P. ;
Abbott, R. ;
Abbott, T. D. ;
Abernathy, M. R. ;
Acernese, F. ;
Ackley, K. ;
Adams, C. ;
Adams, T. ;
Addesso, P. ;
Adhikari, R. X. ;
Adya, V. B. ;
Affeldt, C. ;
Agathos, M. ;
Agatsuma, K. ;
Aggarwal, N. ;
Aguiar, O. D. ;
Aiello, L. ;
Ain, A. ;
Ajith, P. ;
Allen, B. ;
Allocca, A. ;
Altin, P. A. ;
Anderson, S. B. ;
Anderson, W. G. ;
Arai, K. ;
Araya, M. C. ;
Arceneaux, C. C. ;
Areeda, J. S. ;
Arnaud, N. ;
Arun, K. G. ;
Ascenzi, S. ;
Ashton, G. ;
Ast, M. ;
Aston, S. M. ;
Astone, P. ;
Aufmuth, P. ;
Aulbert, C. ;
Babak, S. ;
Bacon, P. ;
Bader, M. K. M. ;
Baker, P. T. ;
Baldaccini, F. ;
Ballardin, G. ;
Ballmer, S. W. ;
Barayoga, J. C. ;
Barclay, S. E. ;
Barish, B. C. ;
Barker, D. ;
Barone, F. ;
Barr, B. .
PHYSICAL REVIEW LETTERS, 2016, 116 (24)
[2]   Generalized entropies and the similarity of texts [J].
Altmann, Eduardo G. ;
Dias, Laercio ;
Gerlach, Martin .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2017,
[3]  
Altmann EG, 2016, LECT N MORPHOGENESIS, P7, DOI 10.1007/978-3-319-24403-7_2
[4]  
Arun R, 2010, LECT NOTES ARTIF INT, V6118, P391
[5]   Efficient and principled method for detecting communities in networks [J].
Ball, Brian ;
Karrer, Brian ;
Newman, M. E. J. .
PHYSICAL REVIEW E, 2011, 84 (03)
[6]  
Bird S., 2009, Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, P1, DOI DOI 10.5555/1717171
[7]   Probabilistic Topic Models [J].
Blei, David M. .
COMMUNICATIONS OF THE ACM, 2012, 55 (04) :77-84
[8]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[9]  
Bouveyron C, 2016, STAT COMPUT, P1
[10]  
Breck Eric, 2019, P 2 SYSML C