On Frequency Estimation and Detection of Frequent Items in Time Faded Streams

被引:13
作者
Cafaro, Massimo [1 ,2 ]
Epicoco, Italo [1 ,2 ]
Pulimeno, Marco [3 ]
Aloisio, Giovanni [1 ,2 ]
机构
[1] Univ Salento, Dept Engn Innovat, I-73100 Lecce, Italy
[2] Euromediterranean Ctr Climate Change, I-73100 Lecce, Italy
[3] Univ Salento, Dept Math & Phys, I-73100 Lecce, Italy
关键词
Data stream mining; time fading model; frequency estimation; frequent items; TOP-K ELEMENTS; FINDING FREQUENT; MINING FREQUENT; FADING MODEL; PARALLEL;
D O I
10.1109/ACCESS.2017.2757238
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We deal with the problem of detecting frequent items in a stream under the constraint that items are weighted, and recent items must be weighted more than older ones. This kind of problem naturally arises in a wide class of applications in which recent data is considered more useful and valuable with regard to older, stale data. The weight assigned to an item is, therefore, a function of its arrival timestamp. As a consequence, whilst in traditional frequent item mining applications we need to estimate frequency counts, we are instead required to estimate decayed counts. These applications are said to work in the time fading model. Two sketch-based algorithms for processing time-decayed streams have been recently published independently near the end of 2016. The Filtered Space Saving with Quasi-Heap (FSSQ) algorithm, besides a sketch, also uses an additional data structure called quasi-heap to maintain frequent items. Forward Decay Count-Min Space Saving (FDCMSS), our algorithm, cleverly combines key ideas borrowed from forward decay, the Count-Min sketch and the Space Saving algorithm. Therefore, it makes sense to compare and contrast the two algorithms in order to fully understand their strengths and weaknesses. We show, through extensive experimental results, that FSSQ is better for detecting frequent items than for frequency estimation. The use of the quasi-heap data structure slows down the algorithm owing to the huge number of maintenance operations. Therefore, FSSQ may not be able to cope with high-speed data streams. FDCMSS is better suitable for frequency estimation; moreover, it is extremely fast and can be used in the context of high-speed data streams and for the detection of frequent items as well, since its recall is always greater than 99%, even when using an extremely tiny amount of space. Therefore, FDCMSS proves to be an overall good choice when considering jointly the recall, precision, average relative error and the speed.
引用
收藏
页码:24078 / 24093
页数:16
相关论文
共 36 条
[1]  
[Anonymous], 2009, PVLDB
[2]  
[Anonymous], 2016, P 17 ITALIAN C THEOR
[3]  
Beyer K, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P359, DOI 10.1145/304181.304214
[4]  
Brin S., 1997, SIGMOD Record, V26, P255, DOI [10.1145/253262.253327, 10.1145/253262.253325]
[5]   Parallel space saving on multi- and many-core processors [J].
Cafaro, Massimo ;
Pulimeno, Marco ;
Epicoco, Italo ;
Aloisio, Giovanni .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2018, 30 (07)
[6]   Mining frequent items in the time fading model [J].
Cafaro, Massimo ;
Pulimeno, Marco ;
Epicoco, Italo ;
Aloisio, Giovanni .
INFORMATION SCIENCES, 2016, 370 :221-238
[7]   A parallel space saving algorithm for frequent items and the Hurwitz zeta distribution [J].
Cafaro, Massimo ;
Pulimeno, Marco ;
Tempesta, Piergiulio .
INFORMATION SCIENCES, 2016, 329 :1-19
[8]   Finding frequent items in parallel [J].
Cafaro, Massimo ;
Tempesta, Piergiulio .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2011, 23 (15) :1774-1788
[9]  
Charikar M, 2002, LECT NOTES COMPUT SC, V2380, P693
[10]   Mining frequent items in data stream using time fading model [J].
Chen, Ling ;
Mei, Qingling .
INFORMATION SCIENCES, 2014, 257 :54-69