DB-GPT-Hub:Fine-Tuning Framework and Benchmark Comparison of NL2SQL for text2sql Tasks

DB-GPT-Hub of NL2SQL<Detailed Chapter>:Fine-Tuning Framework and Benchmark Comparison for text2sql Tasks

With the advent of the Artificial Intelligence Generated Content (AIGC) era, the use of large-scale pre-trained language models (LLMs) for sql generation for text2sql tasks is becoming more and more common. LLM-based text2SQL approaches are usually categorized into two types:

A prompt-based In context Learning (ICL) approach;
Build a dataset based on the text2sql task and fine-tune the open-source LLM to fit the text2sql task.

The prompt-based approach is relatively low cost, with relatively mature results in terms of methodology and effectiveness; the fine-tuned LLM approach is limited by relatively large consumption of resources and high computational cost, and has not been well explored.B-GPT-Hub is a good project, which is a training inference framework and benchmark for text2SQL based on fine-tuned LLM, and is mainly focuses on ways to fine-tune LLM at scale.

Main Contributions:
1. Standardized and comprehensive evaluation of textSQL tasks by fine-tuning medium to large open source LLMs;
2. Modular and easily extensible codebase supporting mainstream LLM and experimental scenarios, prioritizing fine-tuning methods and extending to prompt-based approaches.

The work investigates the potential gains and performance bounds of fine-tuning methods compared to promp-based methods, and explores the best solutions for specific scenarios. It is hoped that DB-GPT-Hub and these findings will drive further research and widespread applications that would otherwise be difficult to realize due to the lack of dedicated open benchmarks.

Specific code:/eosphoros-ai/DB-GPT-Hub
Article:/abs/2406.11434
text2sql list:/eosphoros-ai/Awesome-Text2SQL

-Introduction to GPT-Hub

Text-to-SQL (abbreviated as Text2SQL, or NL2SQL) is a technology that transforms natural language descriptions into corresponding Structured Query Language (abbreviated as SQL), which can effectively assist people in querying massive databases by using concise and clear natural language descriptions and simplify the work of data query and analysis. data query and analysis work. With the advent of Artificial Intelligence Generated Content (AIGC), it is becoming more and more common to use large-scale pre-trained language models for sql generation.

However, in actual development, the current Text-to-SQL technology is not effectively combined with some excellent features of LLM, such as self-instruct, chain of thought, distributed computing, quantitative fine-tuning, attention optimization, etc. In addition, there is also an urgent issue on how Text2SQL technology can combine with powerful natural language understanding to realize the full-link workflow from data-model-fine-tuning-deployment-display. In addition, how to combine Text2SQL technology with powerful natural language understanding to realize the full-link workflow from data - model - fine-tuning - deployment - display is also an urgent problem to be solved. Therefore, in this project, we constructed a set of full-link framework for text2sql task based on fine-tuning, and at the same time, we evaluated the existing mainstream open source models for text2sql task, and constructed a set of benchmarks for open source LLM; at the same time, we also analyzed some of the insights.

2. Code architecture design

In order to fully utilize the language understanding capability of Large Language Model (LLM) to improve the efficiency and accuracy of model fine-tuning for Text2SQL, an end-to-end large-model Text2SQL fine-tuning sub-framework, DB-GPT-Hub, is proposed under the DB-GPT framework. Under the DB-GPT framework, a full-link workflow of data preprocessing - model fine-tuning - model prediction - model validation - model evaluation in the Text2SQL domain is organized, as shown in the following figure:

Figure -Architecture flowchart of GPT-Hub

As shown in Figure 1, the DB-GPT-Hub project focuses on data preprocessing - dataset construction - model fine-tuning - model prediction - model validation. The fine-tuned model can be seamlessly deployed into the DB-GPT framework, and then combined with the capabilities of knowledge quizzing and data analytics to demonstrate the superior performance of the model in the Text2SQL domain.

Specific features:

Data set construction: Native text2SQL data will be processed into a suitable Text Representation Format to fine-tune the LLM, which includes the integration of problem and database schema descriptions into the hints as instructions, as well as various problem representations to improve performance during training and evaluation. In addition, different few-shot strategies (e.g., example selection and organization) will be selected to construct the evaluation dataset
trainThe codebase supports fine-tuning of open source LLMs using PEFT policies. Most public architectures are supported, with models ranging in size from small to large, such as Qwen, Llama, Baichuan, and ChatGLM.
anticipateThe codebase supports fine-tuned versions of open-source LLMs and closed-source LLMs for SQL query inference. Support for scenario-specific SQL generation using few-sample and zero-sample methods.
valuationThe SQL generation is a powerful tool for evaluating the performance of the generated SQL from different perspectives: at the same time, different evaluation metrics (EX, EM) are supported to evaluate the performance of the generated SQL from different perspectives.

2.1 Data set construction

Take the open source dataset Spider as an example to make a detailed introduction, Spider dataset is a multi-database, multi-table, single-round query Text2SQL dataset, which is one of the most challenging datasets in the Text2SQL task, and was released by Yale University's LILY Labs in 2018, with the following features:

Large: The Spider dataset contains 10,181 natural language questions and 5,693 unique complex SQL queries across 200 databases with multiple tables covering 138 different domains.
Strong generalization: The Spider dataset differs from the previous Text2SQL dataset in that it uses different SQL queries and database schemas in the training and test sets, which requires the model to generalize well not only to new SQL queries but also to new database schemas.
Good structure: In contrast to WikiSQL, which has only one table per database, each database on Spider contains multiple tables, which are linked together by primary and foreign keys. There are
Challenge: The Spider dataset contains almost all common advanced syntax in SQL such as "ORDER BY", "GROUP BY", "HAVING", "JOIN", "INSERTION" and nesting as shown below.

Fig. 2: Grammatical distribution of different datasets

The spider dataset categorizes SQL generation into four classes:

Simple:
- Question: What is the number of cars with more than 4 cylinders？
- SQL：SELECT COUNT (*)FROM cars_dataWHERE cylinders > 4
Medium:
- Question: For each stadium, how many concerts are there?
- SQL：SELECT , COUNT (*) FROM concert AS T1 JOIN stadium AS T2ON T1.stadium_id = T2.stadium_idGROUP BY T1.stadium_id
harder
- Question: Which countries in Europe have at least 3 car manufacturers?
- SQL：SELECT name FROM countries AS T1 JOIN continents AS T2 ON T2.cont_id JOIN car makers AS T3 ON T1.country_id = WHERE = 'Europe' GROUPBY T1.country_name HAVINGCOUNT (*) >= 3
extremely difficult
- Question: What is the average life expectancy in the countries where English is not the official language?
- SQL：SELECT AVG(life_expectancy) FROM country WHERE name NOT IN ( SELECT FROM country AS T1 JOIN country_language AS T2 ON = T2.country_code WHERE = "English" AND T2.is_official = "T")

In order to fully utilize the relevant information such as tables and fields in the database, the raw data in Spider is processed to represent in natural language the table structure contained in the database as well as the fields contained in the table structure and the corresponding primary and foreign keys, etc. After data preprocessing, the following data format can be obtained:


{"instruction": "concert_singer(database name) contains tables(a meter (measuring sth)) such as stadium, singer, concert, singer_in_concert. Table stadium has columns(columns) such as stadium_id, location, name, capacity, highest, lowest, average. stadium_id is the primary key(primary key). Table singer has columns such as singer_id, name, country, song_name, song_release_year, age, is_male. singer_id is the primary key. Table concert has columns such as concert_id, concert_name, theme, stadium_id, year. concert_id is the primary key. Table singer_in_concert has columns such as concert_id, singer_id. concert_id is the primary key. The year of concert is the foreign key（foreign key）of location of stadium. The stadium_id of singer_in_concert is the foreign key of name of singer. The singer_id of singer_in_concert is the foreign key of concert_name of concert.",



"input": "How many singers do we have?",



"response": "select count(*) from singer"}


{"instruction": "concert_singer(database_name) contains tables(tables) such as stadium, singer, concert, singer_in_concert. table stadium has columns(columns) such as stadium_id, location, name, capacity, high, low, average. Stadium_id is the primary key (primary key). Table singer has columns like the following:singer_id, name, country, song_name, song_release_year, age, is_male. singer_id is the primary key. The table concert has the following:concert_id, concert_name, theme, stadium_id, year. concert_id is the primary key. The table singer_in_concert has the following: concert_id, singer_id. concert_id is the primary key. The year of the concert is a foreign key (foreign key) to the venue location. singer_in_concert's stadium_id is a foreign key to the singer's name. singer_in_concert's singer_id is a foreign key to the concert's concert_name.

"input": "How many singers do we have?"

"response": "select count(*) from singer"}

Also, the prompt dict was customized to optimize the input in order to better utilize the comprehension capabilities of the large language model, as follows:


SQL_PROMPT_DICT = {



    "prompt_input": (



        "I want you to act as a SQL terminal in front of an example database. "



        "Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\n"



        "###Instruction:\n{instruction}\n\n###Input:\n{input}\n\n###Response: "



    ),



    "prompt_no_input": (



        "I want you to act as a SQL terminal in front of an example database. "



        "Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\n"



        "###Instruction:\n{instruction}\n\n### Response: "



    ),



}

2.2 Model Training

It will be done from the base model and fine-tuned approach to the

2.2.1 Base model

Currently, the supported model structure is as follows, including the mainstream Chinese and foreign open source model series, such as Llama series, Baichuan series, GLM series, Qwen series, etc., covering a wide range, while the benchmark spans the scale of 7b/13b/70b.

Fig. 5: Fine-tuning patterns for different models

2.2.2 Fine-tuning methods

Text2SQL fine-tuning consists mainly of the following processes.

Build the environment
data processing
SFT training
weighting consolidation
model prediction
Effectiveness evaluation

In the task of fine-tuning a large language model for a specific task or domain, retraining all the model parameters will incur expensive training costs, so various optimized fine-tuning schemes have emerged to comprehensively evaluate the speed and accuracy of model fine-tuning, and the popular LoRA (shorthand for Low-Rank Adaptation) and QLoRA (quantization + lora) methods have been implemented. The basic principle of LoRA is to freeze the original model parameters by adding additional network layers to the model and training only these additional network layer parameters. Due to the small number of these additional parameters, this not only reduces the cost of finetune significantly, but also achieves similar results to full model fine-tuning, as shown in the following figure:

The blue part of the figure shows the parameters of the pre-trained model, and LoRA adds two structures, A and B, next to the pre-trained model structure, which are initialized to a Gaussian distribution and 0 respectively.
The input dimension of A and the output dimension of B are the same as the input and output dimensions of the original model, respectively, while the output dimension of A and the input dimension of B are a much smaller value than the input and output dimensions of the original model, which is the embodiment of low-rank, and can greatly reduce the parameters to be trained.
Only the parameters of A and B are updated during training, and the parameters of the pre-trained model are fixed. When inferring, the re-parameterization idea is utilized to merge AB and W, so that no extra computation is introduced when inferring. Moreover, for different downstream tasks, only AB needs to be retrained on the basis of the pre-trained model, which also speeds up the training tempo of large models.

Figure III. Schematic diagram of LoRA fine-tuning

The QLoRA approach uses a low-precision stored data type (NF4) to compress pre-trained language models. By freezing the LM parameters and adding a relatively small number of trainable parameters to the model in the form of Low-Rank Adapters, the LoRA layer is the only parameter updated during training, resulting in a significant compression of the model size with little or no impact on inference. As you can see from the name QLoRA, QLoRA is actually Quantize+LoRA.

Figure 4: QLora Schematic

2.3 Model predictions

After the model is fine-tuned, the dev test set of the spider dataset is tested based on the saved weights and base big model, and the model predicted sql can be obtained. The predicted dev_sql.json has a total of 1034 pieces of data, which again needs to be preprocessed before taking it to the model to predict the results.


{"instruction": "concert_singer contains tables such as stadium, singer, concert, singer_in_concert. Table stadium has columns such as stadium_id, location, name, capacity, highest, lowest, average. stadium_id is the primary key. Table singer has columns such as singer_id, name, country, song_name, song_release_year, age, is_male. singer_id is the primary key. Table concert has columns such as concert_id, concert_name, theme, stadium_id, year. concert_id is the primary key. Table singer_in_concert has columns such as concert_id, singer_id. concert_id is the primary key. The stadium_id of concert is the foreign key of stadium_id of stadium. The singer_id of singer_in_concert is the foreign key of singer_id of singer. The concert_id of singer_in_concert is the foreign key of concert_id of concert.", "input": "How many singers do we have?", "output": "select count(*) from singer"}

The core code for model prediction is as follows:


def inference(model: ChatModel, predict_data: List[Dict], **input_kwargs):



    res = []



    # test



    # for item in predict_data[:20]:



    for item in tqdm(predict_data, desc="Inference Progress", unit="item"):



        response, _ = (query=item["input"], history=[], **input_kwargs)



        (response)



    return res

2.4 Model evaluation

After the model predicts the obtained sql, it needs to be compared with the standard answer from the spider dataset, which is evaluated using the EX (Execution Accuracy) and EM (Exact Match) metrics The EX metric calculates the percentage of the number of SQL execution results that are correct in the dataset, which is publicized as shown below:

\mathrm{EX}=\frac{\Sigma_{n=1}^N \operatorname{score}\left(\hat{Y}_n, Y_n\right)}{N}

The EM metric is to compute how well the SQL generated by the model matches the labeled SQL.

\mathrm{EM}=\frac{\sum_{n=1}^N s \operatorname{core}\left(\hat{Y}_n, Y_n\right)}{N}

devise

3.1 Data sets

The benchmark was constructed on both bird and spirder data:

Spider: is a large-scale cross-domain dataset containing 10,181 natural language queries, 5,693 unique complex SQL queries across 200 databases covering 138 domains. The standard protocol for this dataset divides it into 8,659 training examples and 2,147 test examples across 34 databases.The SQL queries are categorized into four difficulty levels, i.e., easy, medium, hard, and extremely hard.
BIRD: It contains an extensive dataset of 12,751 unique question-SQL pairs covering 95 large databases.The SQL queries are categorized into three difficulty levels, i.e., easy, medium, and challenging. It is worth noting that SQL queries in the BIRD dataset tend to be more complex than those in the Spider dataset.

The overall code is adapted to WikiSQL, CoSQL and other datasets.

For more references:NL2SQL Fundamentals Series (1): Comprehensive Comparative Advantage and Disadvantage Analysis of the Industry's Top Leaderboards, Authoritative Measurement Datasets, and the LLM Big Model (Spider vs BIRD) [Text2SQL, Text2DSL]

3.1.1 spider

EX Accuracy Table of Tables, L for LoRA, QL for QLoRA

Table of EM Accuracy, L for LoRA, QL for QLoRA

3.1.2 BIRD

EX Accuracy Table of Tables, L for LoRA, QL for QLoRA

Table of EM Accuracy, L for LoRA, QL for QLoRA

4. Experiment Insight

4.1 Differences in Effectiveness of Tasks of Different Levels of Difficulty

The effectiveness of the tuned LLM for a range of SQL generation difficulty levels is demonstrated for three 7B models, as shown in the figure below. For all three fine-tuned models, the results show that the magnitude of the performance improvement is negatively correlated with SQL complexity, and that the fine-tuning improves simple SQL more significantly.

4.2 Comparison of LoRA and QLoRA

The table below summarizes the differences between Lora and QLora in terms of EX, EM, time cost, and GPU memory metrics. First, the models tuned with LoRA and QLoRA are found to have limited differences in generative performance (measured in terms of EX and EM). Second, consistent with the quantization mechanism, QLoRA takes more time to converge and has less GPU memory. For example, compared to Qwen-14B-LoRA, its QLoRA counterpart requires only 2x the time and 50% of the GPU memory!

More quality content please pay attention to the public number: Ting, artificial intelligence; will provide some related resources and quality articles, free access to read.