Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domain

被引:0
作者
Nananukul, Navapat [1 ]
Sisaengsuwanchai, Khanin [1 ]
Kejriwal, Mayank [1 ]
机构
[1] University of Southern California, Information Sciences Institute, Los Angeles, CA
来源
Discover Artificial Intelligence | 2024年 / 4卷 / 01期
关键词
Inter-consistency of prompting; Large language models; Prompt engineering; Unsupervised entity resolution;
D O I
10.1007/s44163-024-00159-8
中图分类号
学科分类号
摘要
Entity Resolution (ER) is the problem of semi-automatically determining when two entities refer to the same underlying entity, with applications ranging from healthcare to e-commerce. Traditional ER solutions required considerable manual expertise, including domain-specific feature engineering, as well as identification and curation of training data. Recently released large language models (LLMs) provide an opportunity to make ER more seamless and domain-independent. Because of LLMs’ pre-trained knowledge, the matching step in ER can be made easier by just prompting. However, it is also well known that LLMs can pose risks, that the quality of their outputs can depend on how prompts are engineered, and that the cost of using LLMs can be significant. Unfortunately, a systematic experimental study on the effects of different prompting methods and their respective cost for solving domain-specific entity matching using LLMs, like ChatGPT, has been lacking thus far. This paper aims to address this gap by conducting such a study. We consider some relatively simple and cost-efficient ER prompt engineering methods and apply them to perform product matching on two real-world datasets widely used in the community. We select two well-known e-commerce datasets and provide extensive experimental results to show that an LLM like GPT3.5 is viable for high-performing product matching and, interestingly, that more complicated and detailed (and hence, expensive) prompting methods do not necessarily outperform simpler approaches. We provide brief discussions on qualitative and error analysis, including a study of the inter-consistency of different prompting methods to determine whether they yield stable outputs. Finally, we consider some limitations of LLMs when used as a product matcher in potential real-world e-commerce applications. © The Author(s) 2024.
引用
收藏
相关论文
共 73 条
[1]  
Probing the robustness of pre-trained language models for entity matching, In Proceedings of the 31St ACM International Conference on Information & Knowledge Management, CIKM ’22, pp. 3786-3790, (2022)
[2]  
Beatriz Botero A., Is it a platform? is it a search engine? it’s chatgpt! the European liability regime for large language models, J Free Speech L, 3, (2023)
[3]  
Janani B., Faizan J., Mayank K., Chris M., Sam S., Ozgur O., An ensemble blocking scheme for entity resolution of large and sparse datasets, arXiv preprint, (2016)
[4]  
Entity resolution in graphs, . Mining Graph Data, (2006)
[5]  
Adaptive duplicate detection using learnable string similarity measures, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39-48, (2003)
[6]  
Fu Qiang C.Y., Wen Zhihao Y.Y., Liu Dayiheng F.G., Li Zhixu Z.D., Yanghua X., Hallucination detection: Robustly discerning reliable answers in large language models, Proceedings of the 32Nd ACM International Conference on Information and Knowledge Management, pp. 245-255, (2023)
[7]  
Christen P., Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection, (2012)
[8]  
Febrl: A freely available record linkage system with a graphical user interfac, In the Second Australasian Workshop on Health Data and Knowledge Management, 80, pp. 14-25, (2008)
[9]  
Christophides V., Efthymiou V., Palpanas T., Papadakis G., Stefanidis K., An overview of end-to-end entity resolution for big data, ACM Comput Surv (CSUR), 53, 6, pp. 1-42, (2020)
[10]  
Vassilis C., Vasilis E., Kostas S., Entity resolution in the web of data, (2015)