TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study

被引：0

作者：

Kanburoglu, Ali Bugra ^{[1
]}

Tek, Faik Boray ^{[2
]}

机构：

[1] Isik Univ, Dept Comp Engn, TR-34980 Istanbul, Turkiye

[2] Istanbul Tech Univ, Dept Artificial Intelligence & Data Engn, TR-34467 Istanbul, Turkiye

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Training; Structured Query Language; Accuracy; Error analysis; Benchmark testing; Cognition; Encoding; Text-to-SQL; LLM; large language models; Turkish; dataset; TURSpider;

D O I：

10.1109/ACCESS.2024.3498841

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper introduces TURSpider, a novel Turkish Text-to-SQL dataset developed through human translation of the widely used Spider dataset, aimed at addressing the current lack of complex, cross-domain SQL datasets for the Turkish language. TURSpider incorporates a wide range of query difficulties, including nested queries, to create a comprehensive benchmark for Turkish Text-to-SQL tasks. The dataset enables cross-language comparison and significantly enhances the training and evaluation of large language models (LLMs) in generating SQL queries from Turkish natural language inputs. We fine-tuned several Turkish-supported LLMs on TURSpider and evaluated their performance in comparison to state-of-the-art models like GPT-3.5 Turbo and GPT-4. Our results show that fine-tuned Turkish LLMs demonstrate competitive performance, with one model even surpassing GPT-based models on execution accuracy. We also apply the Chain-of-Feedback (CoF) methodology to further improve model performance, demonstrating its effectiveness across multiple LLMs. This work provides a valuable resource for Turkish NLP and addresses specific challenges in developing accurate Text-to-SQL models for low-resource languages.

引用

页码：169379 / 169387

页数：9

共 21 条

[1] LLM-Based Text-to-SQL for Real-World Databases
Eduardo R. Nascimento
Grettel García
Yenier T. Izquierdo
Lucas Feijó
Gustavo M. C. Coelho
Aiko R. de Oliveira
Melissa Lemos
Robinson L. S. Garcia
Luiz A. P. Paes Leme
Marco A. Casanova
SN Computer Science, 6 (2)
[2] Generate Text-to-SQL Queries Based on Sketch Filling
Fu, Yinpei
Ye, Songtao
Fan, Hongjie
IEEE ACCESS, 2024, 12 : 152392 - 152403
[3] SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks
Tomova, Mihaela Todorova
Hofmann, Martin
Maeder, Patrick
DATA IN BRIEF, 2022, 42
[4] Valid Text-to-SQL Generation with Unification-Based DeepStochLog
Jiao, Ying
De Raedt, Luc
Marra, Giuseppe
NEURAL-SYMBOLIC LEARNING AND REASONING, PT I, NESY 2024, 2024, 14979 : 312 - 330
[5] LLM-Based Text Prediction and Question Answer Models for Aphasia Speech
Manir, Shamiha Binta
Islam, K. M. Sajjadul
Madiraju, Praveen
Deshpande, Priya
IEEE ACCESS, 2024, 12 : 114670 - 114680
[6] LLM-Based Text Style Transfer: Have We Taken a Step Forward?
Toshevska, Martina
Gievska, Sonja
IEEE ACCESS, 2025, 13 : 44707 - 44721
[7] CONDITIONAL LABEL SMOOTHING FOR LLM-BASED DATA AUGMENTATION IN MEDICAL TEXT CLASSIFICATION
Becker, Luca
Pracht, Philip
Sertdal, Peter
Uboreck, Jil
Bendel, Alexander
Martin, Rainer
2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2024, : 833 - 840
[8] Similar Questions Correspond to Similar SQL Queries: A Case-Based Reasoning Approach for Text-to-SQL Translation
Yu, Wei
Guo, Xiaoting
Chen, Fei
Chang, Tao
Wang, Mengzhu
Wang, Xiaodong
CASE-BASED REASONING RESEARCH AND DEVELOPMENT, ICCBR 2021, 2021, 12877 : 294 - 308
[9] Improving the Accuracy of Text-to-SQL Tools Based on Large Language Models for Real-World Relational Databases
Coelho, Gustavo M. C.
Nascimento, Eduardo R. S.
Izquierdo, Yenier T.
Garcia, Grettel M.
Feijo, Lucas
Lemos, Melissa
Garcia, Robinson L. S.
de Oliveira, Aiko R.
Pinheiro, Joao P.
Casanova, Marco A.
DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, DEXA 2024, 2024, 14910 : 93 - 107
[10] Easy-read and large language models: on the ethical dimensions of LLM-based text simplification
Freyer, Nils
Kempt, Hendrik
Kloeser, Lars
ETHICS AND INFORMATION TECHNOLOGY, 2024, 26 (03)

← 1 2 3 →