NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset

A large-scale, curated instruction dataset for training and evaluating LLMs on PostgreSQL-specific SQL and PL/pgSQL generation. Every row is a (question, schema, SQL) triplet with rich metadata for filtering and analysis.

Dataset Summary

Metric	Value
Total rows	211,539
PostgreSQL-specific rows	11,998 (5.7%)
Schema fill rate	82.2%
Explanation fill rate	17.8%
SQL length (median)	83 chars
SQL length (max)	61,419 chars

Splits

Split	Rows
`train`	194,398
`validation`	13,693
`test`	3,448

Schema

Each row contains 11 fields:

Field	Type	Description
`question`	`string`	Natural language instruction or question
`schema`	`string?`	DDL schema context (CREATE TABLE statements), null if not applicable
`sql`	`string`	Ground truth PostgreSQL SQL or PL/pgSQL answer
`explanation`	`string?`	Short explanation of what the SQL does
`validation_query`	`string?`	Query to validate the answer produces correct results
`source`	`string`	Origin of this instruction pair (see Sources below)
`difficulty`	`string`	One of: `basic`, `intermediate`, `advanced`
`category`	`string`	SQL category (see Categories below)
`is_postgresql_specific`	`bool`	True if SQL uses PostgreSQL-specific syntax
`sql_length`	`int32`	Character length of the SQL field
`num_statements`	`int32`	Number of SQL statements (semicolon count)

Sources

Data is aggregated from multiple high-quality sources, each tagged:

Source	Rows
`community_sql_datasets`	115,811
`sql_create_context`	78,392
`postgresql_regression_tests`	11,622
`pgtap_tests`	4,181
`plpgsql_source`	1,529
`synthetic_text_to_sql`	4

Source Descriptions

postgresql_regression_tests — SQL extracted from PostgreSQL's own regression test suite
postgresql_docs — Examples from official PostgreSQL SGML documentation
postgresql_contrib — SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
pgtap_tests — pgTAP unit test SQL
plpgsql_source — PL/pgSQL functions from the PostgreSQL source tree
pgbench_scripts — pgbench benchmark scripts
handcrafted_advanced — Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
sql_create_context — WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
synthetic_text_to_sql — Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
community_sql_datasets — Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)

Difficulty Distribution

Difficulty	Rows
`basic`	147,920
`intermediate`	56,469
`advanced`	7,150

Category	Rows
`query_select`	136,225
`query_aggregation`	32,050
`query_join`	10,597
`dml_insert`	8,763
`other`	4,093
`dml_update`	3,664
`dml_delete`	3,647
`ddl_table`	3,430
`query_window_function`	3,055
`plpgsql_function`	1,912
`ddl_advanced`	1,143
`ddl_index`	806
`plpgsql`	742
`ddl_view`	541
`plpgsql_trigger`	401
`ddl_alter`	235
`admin_maintenance`	125
`dcl_security`	92
`query_recursive_cte`	18

Usage

from datasets import load_dataset

ds = load_dataset("neurondb/neurondb-postgresql-sql")

# Filter for advanced PostgreSQL-specific queries
advanced_pg = ds["train"].filter(
    lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
)

# Filter by category
window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")

# Filter by source
gold = ds["train"].filter(
    lambda x: x["source"] in [
        "postgresql_regression_tests",
        "postgresql_docs",
        "handcrafted_advanced",
    ]
)

Intended Use

Fine-tuning LLMs for PostgreSQL SQL and PL/pgSQL code generation
Evaluating text-to-SQL models on PostgreSQL-specific syntax
Benchmarking SQL generation quality across difficulty levels
Building PostgreSQL-aware coding assistants

Data Quality

All rows have non-empty question and sql fields
MySQL-only and T-SQL-only syntax has been filtered out
Duplicate (question, SQL) pairs have been removed
Rows with trivially short SQL (< 10 chars) are excluded
Each row is tagged with source, difficulty, and category for easy filtering

Examples

Example 1 — basic / query_select

Source: sql_create_context

Question: Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?

Schema:

CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)

SQL:

SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';

Example 2 — intermediate / query_join

Source: community_sql_datasets

Question: What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?

Schema:

CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);

SQL:

SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;

Example 3 — advanced / plpgsql_function

Source: community_sql_datasets

Question: Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).

SQL:

 $l a n g u a g e p l p g s q l; s e l e c t * f r o m s c_{t} e s t (); c r e a t e o r r e p l a c e f u n c t i o n s c_{t} e s t () r e t u r n s s e t o f i n t e g e r a s$ 
declare
  c refcursor;

Explanation: PL/pgSQL object from PostgreSQL core test for Plpgsql.

Example 4 — advanced / query_window_function

Source: community_sql_datasets

Question: What is the difference in the number of attendees for each community education program between the first and last occurrence?

Schema:

CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);

SQL:

SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;

Citation

If you use this dataset, please cite:

@dataset{neurondb_postgresql_sql_2026,
  title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
  author={NeuronDB Team},
  year={2026},
  url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset

Dataset Summary

Splits

Schema

Sources

Source Descriptions

Difficulty Distribution

Categories

Usage

Intended Use

Data Quality

Examples

Example 1 — basic / query_select

Example 2 — intermediate / query_join

Example 3 — advanced / plpgsql_function

Example 4 — advanced / query_window_function

Citation

License