A Systematic Comparison Between Open- and Closed-Source Large Language Models in the Context of Generating GDPR-Compliant Data Categories for Processing Activity Records

被引:0
作者
von Schwerin, Magdalena [1 ]
Reichert, Manfred [1 ]
机构
[1] Institute of Databases and Information Systems, Ulm University, Ulm
来源
Future Internet | 2024年 / 16卷 / 12期
关键词
GDPR documentation; large language model; natural language processing;
D O I
10.3390/fi16120459
中图分类号
学科分类号
摘要
This study investigates the capabilities of open-source Large Language Models (LLMs) in automating GDPR compliance documentation, specifically in generating data categories—types of personal data (e.g., names, email addresses)—for processing activity records, a document required by the General Data Protection Regulation (GDPR). By comparing four state-of-the-art open-source models with the closed-source GPT-4, we evaluate their performance using benchmarks tailored to GDPR tasks: a multiple-choice benchmark testing contextual knowledge (evaluated by accuracy and F1 score) and a generation benchmark evaluating structured data generation. In addition, we conduct four experiments using context-augmenting techniques such as few-shot prompting and Retrieval-Augmented Generation (RAG). We evaluate these on performance metrics such as latency, structure, grammar, validity, and contextual understanding. Our results show that open-source models, particularly Qwen2-7B, achieve performance comparable to GPT-4, demonstrating their potential as cost-effective and privacy-preserving alternatives. Context-augmenting techniques show mixed results, with RAG improving performance for known categories but struggling with categories not contained in the knowledge base. Open-source models excel at structured legal tasks, although challenges remain in handling ambiguous legal language and unstructured scenarios. These findings underscore the viability of open-source models for GDPR compliance, while highlighting the need for fine-tuning and improved context augmentation to address complex use cases. © 2024 by the authors.
引用
收藏
相关论文
共 40 条
  • [21] Petroni F., Rocktaschel T., Riedel S., Lewis P., Bakhtin A., Wu Y., Miller A., Language models as knowledge bases?, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2463-2473
  • [22] Wei J., Bosma M., Zhao V.Y., Guu K., Yu A.W., Lester B., Du N., Dai A.M., Le Q.V., Finetuned language models are zero-shot learners, arXiv, (2021)
  • [23] Apache License, Version 2.0
  • [24] MIT License
  • [25] Fourrier C., Habib N., Lozovskaya A., Szafer K., Wolf T., Open LLM Leaderboard v2
  • [26] Zhou J., Lu T., Mishra S., Brahma S., Basu S., Luan Y., Zhou D., Hou L., Instruction-following evaluation for large language models, arXiv, (2023)
  • [27] Suzgun M., Scales N., Scharli N., Gehrmann S., Tay Y., Chung H.W., Chowdhery A., Le Q.V., Chi E.H., Zhou D., Et al., Challenging big-bench tasks and whether chain-of-thought can solve them, arXiv, (2022)
  • [28] Rein D., Hou B.L., Stickland A.C., Petty J., Pang R.Y., Dirani J., Michael J., Bowman S.R., GPQA: A graduate-level google-proof Q&A benchmark, arXiv, (2023)
  • [29] Sprague Z., Ye X., Bostrom K., Chaudhuri S., Durrett G., MUSR: Testing the limits of chain-of-thought with multistep soft reasoning, arXiv, (2023)
  • [30] Wang Y., Ma X., Zhang G., Ni Y., Chandra A., Guo S., Ren W., Arulraj A., He X., Jiang Z., Et al., MMLU-Pro: A more robust and challenging multi-task language understanding benchmark, arXiv, (2024)