NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset
A large-scale, curated instruction dataset for training and evaluating LLMs on PostgreSQL-specific SQL and PL/pgSQL generation. Every row is a (question, schema, SQL) triplet with rich metadata for filtering and analysis.
Dataset Summary
| Metric | Value |
|---|---|
| Total rows | 211,539 |
| PostgreSQL-specific rows | 11,998 (5.7%) |
| Schema fill rate | 82.2% |
| Explanation fill rate | 17.8% |
| SQL length (median) | 83 chars |
| SQL length (max) | 61,419 chars |
Splits
| Split | Rows |
|---|---|
train |
194,398 |
validation |
13,693 |
test |
3,448 |
Schema
Each row contains 11 fields:
| Field | Type | Description |
|---|---|---|
question |
string |
Natural language instruction or question |
schema |
string? |
DDL schema context (CREATE TABLE statements), null if not applicable |
sql |
string |
Ground truth PostgreSQL SQL or PL/pgSQL answer |
explanation |
string? |
Short explanation of what the SQL does |
validation_query |
string? |
Query to validate the answer produces correct results |
source |
string |
Origin of this instruction pair (see Sources below) |
difficulty |
string |
One of: basic, intermediate, advanced |
category |
string |
SQL category (see Categories below) |
is_postgresql_specific |
bool |
True if SQL uses PostgreSQL-specific syntax |
sql_length |
int32 |
Character length of the SQL field |
num_statements |
int32 |
Number of SQL statements (semicolon count) |
Sources
Data is aggregated from multiple high-quality sources, each tagged:
| Source | Rows |
|---|---|
community_sql_datasets |
115,811 |
sql_create_context |
78,392 |
postgresql_regression_tests |
11,622 |
pgtap_tests |
4,181 |
plpgsql_source |
1,529 |
synthetic_text_to_sql |
4 |
Source Descriptions
postgresql_regression_testsβ SQL extracted from PostgreSQL's own regression test suitepostgresql_docsβ Examples from official PostgreSQL SGML documentationpostgresql_contribβ SQL from contrib modules (pg_trgm, hstore, ltree, etc.)pgtap_testsβ pgTAP unit test SQLplpgsql_sourceβ PL/pgSQL functions from the PostgreSQL source treepgbench_scriptsβ pgbench benchmark scriptshandcrafted_advancedβ Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)sql_create_contextβ WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)synthetic_text_to_sqlβ Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)community_sql_datasetsβ Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)
Difficulty Distribution
| Difficulty | Rows |
|---|---|
basic |
147,920 |
intermediate |
56,469 |
advanced |
7,150 |
Categories
| Category | Rows |
|---|---|
query_select |
136,225 |
query_aggregation |
32,050 |
query_join |
10,597 |
dml_insert |
8,763 |
other |
4,093 |
dml_update |
3,664 |
dml_delete |
3,647 |
ddl_table |
3,430 |
query_window_function |
3,055 |
plpgsql_function |
1,912 |
ddl_advanced |
1,143 |
ddl_index |
806 |
plpgsql |
742 |
ddl_view |
541 |
plpgsql_trigger |
401 |
ddl_alter |
235 |
admin_maintenance |
125 |
dcl_security |
92 |
query_recursive_cte |
18 |
Usage
from datasets import load_dataset
ds = load_dataset("neurondb/neurondb-postgresql-sql")
# Filter for advanced PostgreSQL-specific queries
advanced_pg = ds["train"].filter(
lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
)
# Filter by category
window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")
# Filter by source
gold = ds["train"].filter(
lambda x: x["source"] in [
"postgresql_regression_tests",
"postgresql_docs",
"handcrafted_advanced",
]
)
Intended Use
- Fine-tuning LLMs for PostgreSQL SQL and PL/pgSQL code generation
- Evaluating text-to-SQL models on PostgreSQL-specific syntax
- Benchmarking SQL generation quality across difficulty levels
- Building PostgreSQL-aware coding assistants
Data Quality
- All rows have non-empty
questionandsqlfields - MySQL-only and T-SQL-only syntax has been filtered out
- Duplicate (question, SQL) pairs have been removed
- Rows with trivially short SQL (< 10 chars) are excluded
- Each row is tagged with source, difficulty, and category for easy filtering
Examples
Example 1 β basic / query_select
Source: sql_create_context
Question: Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?
Schema:
CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)
SQL:
SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';
Example 2 β intermediate / query_join
Source: community_sql_datasets
Question: What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?
Schema:
CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);
SQL:
SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;
Example 3 β advanced / plpgsql_function
Source: community_sql_datasets
Question: Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).
SQL:
declare
c refcursor;
Explanation: PL/pgSQL object from PostgreSQL core test for Plpgsql.
Example 4 β advanced / query_window_function
Source: community_sql_datasets
Question: What is the difference in the number of attendees for each community education program between the first and last occurrence?
Schema:
CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);
SQL:
SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;
Citation
If you use this dataset, please cite:
@dataset{neurondb_postgresql_sql_2026,
title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
author={NeuronDB Team},
year={2026},
url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
}
License
Apache 2.0