Multi-Word Structural Topic Modelling of ToR Drug Marketplaces

被引:5
作者
Guarino, Stefano [1 ]
Santoro, Mario [1 ]
机构
[1] CNR, Ist Applicazioni Calcolo Mauro Picone, Via Taurini 19, Rome, Italy
来源
2018 IEEE 12TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC) | 2018年
关键词
STM; N-grams; Tor; Markets; PHRASE; TEXT;
D O I
10.1109/ICSC.2018.00048
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Topic Modelling (TM) is a widely adopted generative model used to infer the thematic organization of text corpora. When document-level covariate information is available, so-called Structural Topic Modelling (STM) is the state-of-the-art approach to embed this information in the topic mining algorithm. Usually, TM algorithms rely on unigrams as the basic text generation unit, whereas the quality and intelligibility of the identified topics would significantly benefit from the detection and usage of topical phrasemes. Following on from previous research, in this paper we propose the first iterative algorithm to extend STM with n-grams, and we test our solution on textual data collected from four well-known ToR drug marketplaces. Significantly, we employ a STM-guided n-gram selection process, so that topic-specific phrasemes can be identified regardless of their global relevance in the corpus. Our experiments show that enriching the dictionary with selected n-grams improves the usability of STM, allowing the discovery of key information hidden in an apparently "mono-thematic" dataset.
引用
收藏
页码:269 / 273
页数:5
相关论文
共 14 条
  • [1] [Anonymous], 2009, ARXIV09071013
  • [2] Exploring and Analyzing the Tor Hidden Services Graph
    Bernaschi, Massimo
    Celestini, Alessandro
    Guarino, Stefano
    Lombardi, Flavio
    [J]. ACM TRANSACTIONS ON THE WEB, 2017, 11 (04)
  • [3] Content and popularity analysis of Tor hidden services
    Biryukov, Alex
    Pustogarov, Ivan
    Thill, Fabrice
    Weinmann, Ralf-Philipp
    [J]. 2014 IEEE 34TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW), 2014, : 188 - 193
  • [4] Bischof JonathanM., 2012, Proceedings of the 29th International Conference on Machine Learning, V29, P201
  • [5] Probabilistic Topic Models
    Blei, David M.
    [J]. COMMUNICATIONS OF THE ACM, 2012, 55 (04) : 77 - 84
  • [6] Celestini A., 2017, WIMS '17, P1, DOI [DOI 10.1145/3102254.3102266, 10.1145/3102254.3102266]
  • [7] Danilevsky M., 2014, P 2014 SIAM INT C DA, P398
  • [8] Scalable Topical Phrase Mining from Text Corpora
    El-Kishky, Ahmed
    Song, Yanglei
    Wang, Chi
    Voss, Clare R.
    Han, Jiawei
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (03): : 305 - 316
  • [9] Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts
    Grimmer, Justin
    Stewart, Brandon M.
    [J]. POLITICAL ANALYSIS, 2013, 21 (03) : 267 - 297
  • [10] Lindsey Robert, 2012, P 2012 JOINT C EMP M, P214