NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset

A large-scale, curated instruction dataset for training and evaluating LLMs on PostgreSQL-specific SQL and PL/pgSQL generation. Every row is a (question, schema, SQL) triplet with rich metadata for filtering and analysis.

Dataset Summary

Metric Value
Total rows 211,539
PostgreSQL-specific rows 11,998 (5.7%)
Schema fill rate 82.2%
Explanation fill rate 17.8%
SQL length (median) 83 chars
SQL length (max) 61,419 chars

Splits

Split Rows
train 194,398
validation 13,693
test 3,448

Schema

Each row contains 11 fields:

Field Type Description
question string Natural language instruction or question
schema string? DDL schema context (CREATE TABLE statements), null if not applicable
sql string Ground truth PostgreSQL SQL or PL/pgSQL answer
explanation string? Short explanation of what the SQL does
validation_query string? Query to validate the answer produces correct results
source string Origin of this instruction pair (see Sources below)
difficulty string One of: basic, intermediate, advanced
category string SQL category (see Categories below)
is_postgresql_specific bool True if SQL uses PostgreSQL-specific syntax
sql_length int32 Character length of the SQL field
num_statements int32 Number of SQL statements (semicolon count)

Sources

Data is aggregated from multiple high-quality sources, each tagged:

Source Rows
community_sql_datasets 115,811
sql_create_context 78,392
postgresql_regression_tests 11,622
pgtap_tests 4,181
plpgsql_source 1,529
synthetic_text_to_sql 4

Source Descriptions

  • postgresql_regression_tests β€” SQL extracted from PostgreSQL's own regression test suite
  • postgresql_docs β€” Examples from official PostgreSQL SGML documentation
  • postgresql_contrib β€” SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
  • pgtap_tests β€” pgTAP unit test SQL
  • plpgsql_source β€” PL/pgSQL functions from the PostgreSQL source tree
  • pgbench_scripts β€” pgbench benchmark scripts
  • handcrafted_advanced β€” Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
  • sql_create_context β€” WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
  • synthetic_text_to_sql β€” Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
  • community_sql_datasets β€” Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)

Difficulty Distribution

Difficulty Rows
basic 147,920
intermediate 56,469
advanced 7,150

Categories

Category Rows
query_select 136,225
query_aggregation 32,050
query_join 10,597
dml_insert 8,763
other 4,093
dml_update 3,664
dml_delete 3,647
ddl_table 3,430
query_window_function 3,055
plpgsql_function 1,912
ddl_advanced 1,143
ddl_index 806
plpgsql 742
ddl_view 541
plpgsql_trigger 401
ddl_alter 235
admin_maintenance 125
dcl_security 92
query_recursive_cte 18

Usage

from datasets import load_dataset

ds = load_dataset("neurondb/neurondb-postgresql-sql")

# Filter for advanced PostgreSQL-specific queries
advanced_pg = ds["train"].filter(
    lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
)

# Filter by category
window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")

# Filter by source
gold = ds["train"].filter(
    lambda x: x["source"] in [
        "postgresql_regression_tests",
        "postgresql_docs",
        "handcrafted_advanced",
    ]
)

Intended Use

  • Fine-tuning LLMs for PostgreSQL SQL and PL/pgSQL code generation
  • Evaluating text-to-SQL models on PostgreSQL-specific syntax
  • Benchmarking SQL generation quality across difficulty levels
  • Building PostgreSQL-aware coding assistants

Data Quality

  • All rows have non-empty question and sql fields
  • MySQL-only and T-SQL-only syntax has been filtered out
  • Duplicate (question, SQL) pairs have been removed
  • Rows with trivially short SQL (< 10 chars) are excluded
  • Each row is tagged with source, difficulty, and category for easy filtering

Examples

Example 1 β€” basic / query_select

Source: sql_create_context

Question: Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?

Schema:

CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)

SQL:

SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';

Example 2 β€” intermediate / query_join

Source: community_sql_datasets

Question: What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?

Schema:

CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);

SQL:

SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;

Example 3 β€” advanced / plpgsql_function

Source: community_sql_datasets

Question: Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).

SQL:

languageplpgsql;selectβˆ—fromsctest();createorreplacefunctionsctest()returnssetofintegeras language plpgsql;

select * from sc_test();

create or replace function sc_test() returns setof integer as 
declare
  c refcursor;

Explanation: PL/pgSQL object from PostgreSQL core test for Plpgsql.

Example 4 β€” advanced / query_window_function

Source: community_sql_datasets

Question: What is the difference in the number of attendees for each community education program between the first and last occurrence?

Schema:

CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);

SQL:

SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;

Citation

If you use this dataset, please cite:

@dataset{neurondb_postgresql_sql_2026,
  title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
  author={NeuronDB Team},
  year={2026},
  url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support