metadata

license: apache-2.0
pipeline_tag: text-generation
library_name: transformers

SLM-SQL: An Exploration of Small Language Models for Text-to-SQL

The SLM-SQL model, presented in the paper SLM-SQL: An Exploration of Small Language Models for Text-to-SQL, explores the potential of Small Language Models (SLMs) in translating natural language questions into SQL queries (Text-to-SQL).

Unlike traditional Large Language Models (LLMs) which have demonstrated strong performance in Text-to-SQL, SLMs (ranging from 0.5B to 1.5B parameters) currently underperform but offer significant advantages in inference speed and suitability for edge deployment. This work addresses their limitations by leveraging recent advancements in post-training techniques.

Specifically, the authors utilized the open-source SynSQL-2.5M dataset to construct two derived datasets: SynSQL-Think-916K for SQL generation and SynSQL-Merge-Think-310K for SQL merge revision. They applied supervised fine-tuning and reinforcement learning-based post-training to the SLMs, followed by inference using a corrective self-consistency approach.

Experimental results validate the effectiveness and generalizability of the SLM-SQL method. On the BIRD development set, the evaluated models achieved an average improvement of 31.4 points. Notably, the 0.5B model reached 56.87% execution accuracy (EX), while the 1.5B model achieved 67.08% EX.

The authors plan to release their dataset, model, and code on GitHub.

Usage

This model is designed for Text-to-SQL generation. While the official code and specific usage instructions are planned for release on GitHub, you would typically load such a model using the transformers library. The exact prompt format and required inputs (e.g., database schema) will be detailed in the official repository.

# from transformers import AutoModelForCausalLM, AutoTokenizer

# # Replace "your_model_name" with the specific SLM-SQL model once it's available on the Hub.
# model_name = "your_organization/slm-sql-model-name" 

# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)

# # Example for a Text-to-SQL task (prompting may vary significantly)
# # input_natural_language_query = "What are the names of all employees in the 'Engineering' department?"
# # Assuming a prompt template like: "Translate to SQL: [NL_QUERY] [SCHEMA_INFO]"
# # prompt = f"Translate the following natural language query into SQL:\
{input_natural_language_query}\
SQL:"

# # encoded_input = tokenizer(prompt, return_tensors="pt")
# # output_ids = model.generate(**encoded_input, max_new_tokens=256) # Adjust max_new_tokens as needed for SQL
# # decoded_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
# # print(decoded_output)

# # Please refer to the official repository (once released) for detailed usage instructions,
# # including specific prompt formats and handling of database schemas.