Optimizing Multimodal Data Queries in Data Lakes

被引:0
作者
Xiong, Runqun [1 ]
Zhao, Shiyuan [2 ]
Chen, Ciyuan [1 ]
Xu, Zhuqing [3 ]
机构
[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
[2] Southeast Univ, Sch Comp Software Engn, Nanjing 211189, Peoples R China
[3] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 211106, Peoples R China
来源
TSINGHUA SCIENCE AND TECHNOLOGY | 2025年 / 30卷 / 06期
关键词
multimodal data query; data lake; contrastive learning; related data query; SEARCH; DISCOVERY; TABLES;
D O I
10.26599/TST.2025.9010022
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper addresses the challenge of efficiently querying multimodal related data in data lakes, a large-scale storage and management system that supports heterogeneous data formats, including structured, semi-structured, and unstructured data. Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities, such as tables, images, and text, which has applications in fields like e-commerce, healthcare, and education. However, existing methods primarily focus on single-modality queries, such as joinable or unionable table discovery, and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency. To tackle these challenges, we propose a Multimodal data Query mechanism for Data Lakes (MQDL), which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities. Additionally, we introduce product quantization to optimize candidate verification during queries, reducing computational overhead while maintaining precision. We evaluate MQDL using a table-image dataset across multiple business scenarios, measuring metrics such as precision, recall, and F1-score. Results show that MQDL achieves an accuracy rate of approximately 90%, while demonstrating strong scalability and reduced query response time compared to traditional methods. These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.
引用
收藏
页码:2625 / 2637
页数:13
相关论文
共 36 条
[1]  
2023, Arxiv, DOI arXiv:2303.08774
[2]   Multimodal Large Language Models in Health Care:Applications,Challenges, and Future Outlook [J].
AlSaad, Rawan ;
Abd-alrazaq, Alaa ;
Boughorbel, Sabri ;
Ahmed, Arfan ;
Renault, Max-Antoine ;
Damseh, Rafat ;
Sheikh, Javaid .
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[3]   Data Lakes: A Survey of Concepts and Architectures [J].
Azzabi, Sarah ;
Alfughi, Zakiya ;
Ouda, Abdelkader .
COMPUTERS, 2024, 13 (07)
[4]   Dataset Discovery in Data Lakes [J].
Bogatu, Alex ;
Fernandes, Alvaro A. A. ;
Paton, Norman W. ;
Konstantinou, Nikolaos .
2020 IEEE 36TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2020), 2020, :709-720
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]   Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations [J].
Bruch, Sebastian ;
Nardini, Franco Maria ;
Rulli, Cosimo ;
Venturini, Rossano .
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, :152-162
[7]   Data Lake Architecture for Distribution System Operator [J].
Cardoso, Beatriz Batista ;
Righetto, Sophia Boing ;
Martins, Eduardo Luiz ;
Izumida Martins, Marcos Aurelio ;
Pereira, Andre Luiz ;
de Francisci, Silvia .
2021 IEEE POWER & ENERGY SOCIETY INNOVATIVE SMART GRID TECHNOLOGIES CONFERENCE (ISGT), 2021,
[8]   LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes [J].
Chai, Chengliang ;
Deng, Yuhao ;
Zhan, Yutong ;
Cao, Ziqi ;
Zhang, Yuanfang ;
Cao, Lei ;
Wang, Yuping ;
Zhang, Zhiwei ;
Yuan, Ye ;
Wang, Guoren ;
Tang, Nan .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (12) :4381-4384
[9]  
Chen T., 2020, PMLR, P1597
[10]   Emerging Trends Word2Vec [J].
Church, Kenneth Ward .
NATURAL LANGUAGE ENGINEERING, 2017, 23 (01) :155-162