LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes

被引:3
作者
Chai, Chengliang [1 ]
Deng, Yuhao [1 ]
Zhan, Yutong [1 ]
Cao, Ziqi [1 ]
Zhang, Yuanfang [1 ]
Cao, Lei [2 ]
Wang, Yuping [1 ]
Zhang, Zhiwei [1 ]
Yuan, Ye [1 ]
Wang, Guoren [1 ]
Tang, Nan [3 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Univ Arizona, MIT, Tempe, AZ USA
[3] HKUST, Guangzhou, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 12期
基金
国家重点研发计划;
关键词
D O I
10.14778/3685800.3685880
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Searching tables from poorly maintained data lakes has long been recognized as a formidable challenge in the realm of data management. There are three pivotal tasks: keyword-based, joinable and unionable table search, which form the backbone of tasks that aim to make sense of diverse datasets, such as machine learning. In this demo, we propose LakeCompass, an end-to-end prototype system that maintains abundant tabular data, supports all above search tasks with high efficacy, and well serves downstream ML modeling. To be specific, LakeCompass manages numerous real tables over which diverse types of indexes are built to support efficient search based on different user requirements. Particularly, LakeCompass could automatically integrate these discovered tables to improve the downstream model performance in an iterative approach. Finally, we provide both Python APIs and Web interface to facilitate flexible user interaction.
引用
收藏
页码:4381 / 4384
页数:4
相关论文
共 17 条
[1]  
[Anonymous], 2012, SIGMOD 2012
[2]   Auctus: A Dataset Search Engine for Data Discovery and Augmentation [J].
Castelo, Sonia ;
Rampin, Remi ;
Santos, Aecio ;
Bessa, Aline ;
Chirigati, Fernando ;
Freire, Juliana .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12) :2791-2794
[3]   Selective Data Acquisition in the Wild for Model Charging [J].
Chai, Chengliang ;
Liu, Jiabin ;
Tang, Nan ;
Li, Guoliang ;
Luo, Yuyu .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2022, 15 (07) :1466-1478
[4]   ARDA: Automatic Relational Data Augmentation for Machine Learning [J].
Chepurko, Nadiia ;
Marcus, Ryan ;
Zgraggen, Emanuel ;
Castro Fernandez, Raul ;
Kraska, Tim ;
Karger, David .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (09) :1373-1387
[5]   LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes [J].
Deng, Yuhao ;
Chai, Chengliang ;
Cao, Lei ;
Yuan, Qin ;
Chen, Siyuan ;
Yu, Yanrui ;
Sun, Zhaoze ;
Wang, Junyi ;
Li, Jiajun ;
Cao, Ziqi ;
Jin, Kaisen ;
Zhang, Chi ;
Jiang, Yuqing ;
Zhang, Yuanfang ;
Wang, Yuping ;
Yuan, Ye ;
Wang, Guoren ;
Tang, Nan .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (08) :1925-1938
[6]  
Esmailoghli Mahdi, 2021, INT C EXTENDING DATA
[7]   Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning [J].
Fan, Grace ;
Wang, Jin ;
Li, Yuliang ;
Zhang, Dan ;
Miller, Renee J. .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (07) :1726-1739
[8]   Aurum: A Data Discovery System [J].
Fernandez, Raul Castro ;
Abedjan, Ziawasch ;
Koko, Famien ;
Yuan, Gina ;
Madden, Sam ;
Stonebraker, Michael .
2018 IEEE 34TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2018, :1001-1012
[9]  
Galhotra S, 2023, Arxiv, DOI arXiv:2304.09068
[10]  
Iida H, 2021, 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), P3446