Magellan: Toward Building Entity Matching Management Systems

被引:3
作者
Konda, Pradap [1 ]
Das, Sanjib [1 ]
Suganthan, Paul G. C. [1 ]
Martinkus, Philip [1 ]
Doan, AnHai [1 ]
Ardalan, Adel [1 ]
Ballard, Jeffrey R. [1 ]
Govind, Yash [1 ]
Li, Han [1 ]
Panahi, Fatemah [2 ]
Zhang, Haojun [1 ]
Naughton, Jeff [2 ]
Prasad, Shishir [3 ]
Krishnan, Ganesh [3 ]
Deep, Rohit [3 ]
Raghavendra, Vijay [3 ]
机构
[1] Univ Wisconsin, Madison, WI 53706 USA
[2] Google, Mountain View, CA USA
[3] WalmartLabs, Mountain View, CA USA
关键词
Open systems - Data visualization;
D O I
10.14778/2994509.2994535
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Entity matching (EM) has been a long-standing challenge in data management. Most current EM works focus only on developing matching algorithms. We argue that far more efforts should be devoted to building EM systems. We discuss the limitations of current EM systems, then describe Magellan, a new kind of EM system. Magellan is novel in four important aspects. (1) It provides how-to guides that tell users what to do in each EM scenario, step by step. (2) It provides tools to help users execute these steps; the tools seek to cover the entire EM pipeline, not just blocking and matching as current EM systems do. (3) Tools are built into the Python open-source data science ecosystem, allowing Magellan to borrow a rich set of capabilities in data cleaning, IE, visualization, learning, etc. (4) Magellan provides a powerful scripting environment to facilitate interactive experimentation and quick "patching" of the system. We describe research challenges and present extensive experiments that show the promise of the Magellan approach.
引用
收藏
页码:33 / 40
页数:8
相关论文
共 37 条
[1]  
Amershi S., 2015, CHI
[2]  
Ankerst M., 1999, KDD
[3]  
[Anonymous], 2018, SIGMOD
[4]  
[Anonymous], 2011, CrowdDB: answering queries with crowdsourcing
[5]  
[Anonymous], 2012, DATA MATCHING, DOI DOI 10.1007/978-3-642-31164-2
[6]  
ARASU A., 2010, SIGMOD
[7]  
Becker B., 2002, INFORM VISUALIZATION
[8]  
Bellare K., 2012, KDD
[9]   MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive [J].
Bernstein, Matthew N. ;
Doan, Anhai ;
Dewey, Colin N. .
BIOINFORMATICS, 2017, 33 (18) :2914-2923
[10]  
Buitinck L., 2013, ECML PKDD WORKSH LAN, P108