TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study

被引:0
|
作者
Kanburoglu, Ali Bugra [1 ]
Tek, Faik Boray [2 ]
机构
[1] Isik Univ, Dept Comp Engn, TR-34980 Istanbul, Turkiye
[2] Istanbul Tech Univ, Dept Artificial Intelligence & Data Engn, TR-34467 Istanbul, Turkiye
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Training; Structured Query Language; Accuracy; Error analysis; Benchmark testing; Cognition; Encoding; Text-to-SQL; LLM; large language models; Turkish; dataset; TURSpider;
D O I
10.1109/ACCESS.2024.3498841
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide range of query difficulties, including nested queries, to create a comprehensive benchmark for Turkish Text-to-SQL tasks. The dataset enables cross-language comparison and significantly enhances the training and evaluation of large language models (LLMs) in generating SQL queries from Turkish natural language inputs. We fine-tuned several Turkish-supported LLMs on TURSpider and evaluated their performance in comparison to state-of-the-art models like GPT-3.5 Turbo and GPT-4. Our results show that fine-tuned Turkish LLMs demonstrate competitive performance, with one model even surpassing GPT-based models on execution accuracy. We also apply the Chain-of-Feedback (CoF) methodology to further improve model performance, demonstrating its effectiveness across multiple LLMs. This work provides a valuable resource for Turkish NLP and addresses specific challenges in developing accurate Text-to-SQL models for low-resource languages.
引用
收藏
页码:169379 / 169387
页数:9
相关论文
共 21 条
  • [1] LLM-Based Text-to-SQL for Real-World Databases
    Eduardo R. Nascimento
    Grettel García
    Yenier T. Izquierdo
    Lucas Feijó
    Gustavo M. C. Coelho
    Aiko R. de Oliveira
    Melissa Lemos
    Robinson L. S. Garcia
    Luiz A. P. Paes Leme
    Marco A. Casanova
    SN Computer Science, 6 (2)
  • [2] Generate Text-to-SQL Queries Based on Sketch Filling
    Fu, Yinpei
    Ye, Songtao
    Fan, Hongjie
    IEEE ACCESS, 2024, 12 : 152392 - 152403
  • [3] SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks
    Tomova, Mihaela Todorova
    Hofmann, Martin
    Maeder, Patrick
    DATA IN BRIEF, 2022, 42
  • [4] Valid Text-to-SQL Generation with Unification-Based DeepStochLog
    Jiao, Ying
    De Raedt, Luc
    Marra, Giuseppe
    NEURAL-SYMBOLIC LEARNING AND REASONING, PT I, NESY 2024, 2024, 14979 : 312 - 330
  • [5] LLM-Based Text Prediction and Question Answer Models for Aphasia Speech
    Manir, Shamiha Binta
    Islam, K. M. Sajjadul
    Madiraju, Praveen
    Deshpande, Priya
    IEEE ACCESS, 2024, 12 : 114670 - 114680
  • [6] LLM-Based Text Style Transfer: Have We Taken a Step Forward?
    Toshevska, Martina
    Gievska, Sonja
    IEEE ACCESS, 2025, 13 : 44707 - 44721
  • [7] CONDITIONAL LABEL SMOOTHING FOR LLM-BASED DATA AUGMENTATION IN MEDICAL TEXT CLASSIFICATION
    Becker, Luca
    Pracht, Philip
    Sertdal, Peter
    Uboreck, Jil
    Bendel, Alexander
    Martin, Rainer
    2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2024, : 833 - 840
  • [8] Similar Questions Correspond to Similar SQL Queries: A Case-Based Reasoning Approach for Text-to-SQL Translation
    Yu, Wei
    Guo, Xiaoting
    Chen, Fei
    Chang, Tao
    Wang, Mengzhu
    Wang, Xiaodong
    CASE-BASED REASONING RESEARCH AND DEVELOPMENT, ICCBR 2021, 2021, 12877 : 294 - 308
  • [9] Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases
    Coelho, Gustavo M. C.
    Nascimento, Eduardo R. S.
    Izquierdo, Yenier T.
    Garcia, Grettel M.
    Feijo, Lucas
    Lemos, Melissa
    Garcia, Robinson L. S.
    de Oliveira, Aiko R.
    Pinheiro, Joao P.
    Casanova, Marco A.
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, DEXA 2024, 2024, 14910 : 93 - 107
  • [10] Easy-read and large language models: on the ethical dimensions of LLM-based text simplification
    Freyer, Nils
    Kempt, Hendrik
    Kloeser, Lars
    ETHICS AND INFORMATION TECHNOLOGY, 2024, 26 (03)