Parrot: A Progressive Analysis System on Large Text Collections

被引:10
作者
Zhang, Yazhong [1 ,2 ]
Zhang, Hanbing [1 ,2 ]
He, Zhenying [1 ,2 ]
Jing, Yinan [1 ,2 ]
Zhang, Kai [1 ,2 ]
Wang, X. Sean [1 ,2 ,3 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[3] Shanghai Inst Intelligent Elect & Syst, Shanghai, Peoples R China
基金
国家重点研发计划;
关键词
Approximate query processing; Text data analytics; Term frequency; Bootstrap;
D O I
10.1007/s41019-020-00144-y
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The size of textual data continues to grow along with the need for timely and cost-effective analysis, while the growth of computation power cannot keep up with the growth of data. The delays when processing huge textual data can negatively impact user activity and insight. This calls for a paradigm shift from blocking fashion to progressive processing. In this paper, we propose a sample-based progressive processing model that focuses on term frequency calculation on text. The model is based on an incremental execution engine and will calculate a series of approximate results for a single query in a progressive way to provide a smooth trade-off between accuracy and latency. As a part, we proposed a new variant of the bootstrap technique to quantify result error progressively. We implemented this method in our system called Parrot on top of Apache Spark and used real-world data to test its performance. Experiments demonstrate that our method is 2.4x-19.7x faster to get a result within 1% error while the confidence interval always covers the accurate results very well.
引用
收藏
页码:1 / 19
页数:19
相关论文
共 31 条
  • [1] Acharya S, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P574, DOI 10.1145/304181.304581
  • [2] Knowing When You're Wrong: Building Fast and Reliable Approximate Query Processing Systems
    Agarwal, Sameer
    Milner, Henry
    Kleiner, Ariel
    Talwalkar, Ameet
    Jordan, Michael
    Madden, Samuel
    Mozafari, Barzan
    Stoica, Ion
    [J]. SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 481 - 492
  • [3] [Anonymous], 2007, CIDR, DOI DOI 10.1002/PER
  • [4] [Anonymous], 2013, P 8 ACM EUR C COMP S
  • [5] Textual aggregation approaches in OLAP context: A survey
    Bouakkaz, Mustapha
    Ouinten, Youcef
    Loudcher, Sabine
    Strekalova, Yulia
    [J]. INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT, 2017, 37 (06) : 684 - 692
  • [6] Corral A, 2014, ABS14078322 CORR
  • [7] Dimitriadou K, 2014, 2014 IEEE 30TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW), P292, DOI 10.1109/ICDEW.2014.6818343
  • [8] 1977 RIETZ LECTURE - BOOTSTRAP METHODS - ANOTHER LOOK AT THE JACKKNIFE
    EFRON, B
    [J]. ANNALS OF STATISTICS, 1979, 7 (01) : 1 - 26
  • [9] Revisiting Reuse for Approximate Query Processing
    Galakatos, Alex
    Crotty, Andrew
    Zgraggen, Emanuel
    Binnig, Carsten
    Kraska, Tim
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 10 (10): : 1142 - 1153
  • [10] Gray J, 2007, ABSCS0701155 CORR