Apollo: Near-Duplicate Detection for Job Ads in the Online Recruitment Domain

被引:4
作者
Burk, Hunter [1 ]
Javed, Faizan [1 ]
Balaji, Janani [1 ]
机构
[1] CareerBuilder LLC, Atlanta, GA 30326 USA
来源
2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2017) | 2017年
关键词
Near duplicate detection; Job ads deduplication;
D O I
10.1109/ICDMW.2017.29
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Job ad data has become an essential part of the recruiting world, helping recruiters to construct views of the labor market to determine emerging skills, closest competitors, and where to get the most value for each recruiting dollar spent. Collecting this data, however, can be problematic, as job ads are posted redundantly at numerous online locations. In this paper, we detail a domain-specific near-duplicate detection methodology aimed at tackling this problem. More specifically, we discuss Apollo, a near-duplicate detection system for job ads. Apollo is in production at CareerBuilder, a large online recruitment company and powers many downstream analytics applications. Its effectiveness, predicated on precision, recall, F-score, and run time, is then compared against other industry-standard deduplication methods to prove its viability over existing paradigms.
引用
收藏
页码:177 / 182
页数:6
相关论文
共 11 条
[1]  
Alonso Omar, 2013, Information Retrieval Technology. 9th Asia Information Retrieval Societies Conference, AIRS 2013. Proceedings: LNCS 8281, P203, DOI 10.1007/978-3-642-45068-6_18
[2]  
[Anonymous], STOC 2002
[3]   Syntactic clustering of the Web [J].
Broder, AZ ;
Glassman, SC ;
Manasse, MS ;
Zweig, G .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1997, 29 (8-13) :1157-1166
[4]   Collection statistics for fast duplicate document detection [J].
Chowdhury, A ;
Frieder, O ;
Grossman, D ;
McCabe, MC .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (02) :171-191
[5]   On the evolution of clusters of near-duplicate web pages [J].
Fetterly, D ;
Manasse, M ;
Najork, M .
FIRST LATIN AMERICAN WEB CONGRESS, PROCEEDINGS, 2003, :37-45
[6]  
Hajishirzi H, 2010, SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, P419
[7]  
Henzinger M., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P284, DOI 10.1145/1148170.1148222
[8]   DReAM: An Approach to Estimate per-Task DRAM Energy in Multicore Systems [J].
Liu, Qixiao ;
Moreto, Miquel ;
Abella, Jaume ;
Cazorla, Francisco J. ;
Valero, Mateo .
ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2016, 22 (01)
[9]   New issues in near-duplicate detection [J].
Potthast, Martin ;
Stein, Benno .
DATA ANALYSIS, MACHINE LEARNING AND APPLICATIONS, 2008, :601-609
[10]  
Theobald M, 2008, P 31 ANN INT ACM SIG, P563, DOI DOI 10.1145/1390334.1390431