Similarity Grouping in Big Data Systems

被引:1
作者
Silva, Yasin N. [1 ]
Sandoval, Manuel [1 ]
Prado, Diana [1 ]
Wallace, Xavier [1 ]
Rong, Chuitian [2 ]
机构
[1] Arizona State Univ, Glendale, AZ 85306 USA
[2] Tianjin Polytech Univ, Tianjin, Peoples R China
来源
SIMILARITY SEARCH AND APPLICATIONS (SISAP 2019) | 2019年 / 11807卷
关键词
Similarity grouping; Big data systems; Performance evaluation; MapReduce; Spark; Hadoop; Clustering; GROUP-BY;
D O I
10.1007/978-3-030-32047-8_19
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Distributed computing technologies have opened the door for a wide range of organizations to analyze massive amounts of data. Grouping (fast but based on exact semantics) and clustering (relatively slow but based on similarity-aware semantics) are among the most useful data analysis operations. Previous work introduced the Similarity Grouping (SG) operator, which aims to integrate the best features of grouping and clustering, i.e., fast execution times and similarity-aware grouping semantics. The SG operators, however, were proposed for single node relational database systems. This paper introduces the Distributed Similarity Grouping (DSG) operator, a highly parallel operator for identifying similarity groups in big datasets. DSG enables the identification of groups where all the elements are within a given threshold from each other. This paper presents DSG's design details, implementation guidelines on Spark and Hadoop (two important Big Data systems), and extensive performance and scalability evaluation.
引用
收藏
页码:212 / 220
页数:9
相关论文
共 15 条
[1]  
Anchalia PP, 2013, INT C INFO SCI APPL
[2]  
[Anonymous], 2002, Technical Report
[3]   Bigtable: A distributed storage system for structured data [J].
Chang, Fay ;
Dean, Jeffrey ;
Ghemawat, Sanjay ;
Hsieh, Wilson C. ;
Wallach, Deborah A. ;
Burrows, Mike ;
Chandra, Tushar ;
Fikes, Andrew ;
Gruber, Robert E. .
ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2008, 26 (02)
[4]   Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113
[5]  
Ester M., 1996, KDD 96 P
[6]  
Farnstrom F., 2000, ACM SIGKDD Explorations Newsletter, V2, P51, DOI DOI 10.1145/360402.360419
[7]  
Garcia-Molina Hector, 2008, Database Systems: The Complete Book, V2
[8]   Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals [J].
Gray, J ;
Bosworth, A ;
Layman, A ;
Pirahesh, H .
PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, 1996, :152-159
[9]   Cure: An efficient clustering algorithm for large databases [J].
Guha, S ;
Rastogi, R ;
Shim, K .
INFORMATION SYSTEMS, 2001, 26 (01) :35-58
[10]   Metric space similarity joins [J].
Jacox, Edwin H. ;
Samet, Hanan .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2008, 33 (02)