Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications

被引:3
作者
Sacca, Domenico [1 ]
Serra, Edoardo [2 ]
Rullo, Antonino [1 ]
机构
[1] Univ Calabria, DIMES Dept, Arcavacata Di Rende, Italy
[2] Boise State Univ, CS Dept, Boise, ID 83725 USA
关键词
Data mining; Frequent itemset mining; Inverse problems; Classification; Linear programming; Big data; Synthetic dataset; PRIVACY-PRESERVING DATA;
D O I
10.1007/s10618-019-00643-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The development of novel platforms and techniques for emerging "Big Data" applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset X is analyzed to derive relevant patterns Z (latent variables) and, then, such patterns are used to reconstruct a new dataset X ' that is like X but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining (IFM), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of IFM within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of IFM, an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.
引用
收藏
页码:1736 / 1774
页数:39
相关论文
共 47 条
[31]   De-anonymizing Social Networks [J].
Narayanan, Arvind ;
Shmatikov, Vitaly .
PROCEEDINGS OF THE 2009 30TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, 2009, :173-187
[32]  
Oliveira SRM, 2003, THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, P613
[33]  
Papadimitriou Christos H., 1994, Computational complexity
[34]  
Pasquier N, 1999, LECT NOTES COMPUT SC, V1540, P398
[35]   The Synthetic data vault [J].
Patki, Neha ;
Wedge, Roy ;
Veeramachaneni, Kalyan .
PROCEEDINGS OF 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS, (DSAA 2016), 2016, :399-410
[36]   Distribution-based synthetic database generation techniques for itemset mining [J].
Ramesh, G ;
Zaki, MJ ;
Maniatty, WA .
9th International Database Engineering & Application Symposium, Proceedings, 2005, :307-316
[37]  
Ramesh Ganesh., 2003, P 22 ACM SIGMOD SIGA, P284, DOI DOI 10.1145/773153.773181
[38]  
Sacca D, 2013, NUMBER MINIMAL HYPER
[39]  
Shah A., 2016, INT J COMPUTER APPL, V137, P40
[40]   A transversal hypergraph approach for the frequent itemset hiding problem [J].
Stavropoulos, Elias C. ;
Verykios, Vassilios S. ;
Kagklis, Vasileios .
KNOWLEDGE AND INFORMATION SYSTEMS, 2016, 47 (03) :625-645