Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Entz 
posted an update 2 days ago
Post
96
London Property Market Analyst — GPT + HM Land Registry + Gradio (full pipeline breakdown)

---

Built a production RAG-style chatbot over 32 million rows of official UK property transaction data. Sharing the architecture in case it's useful for others building structured-data Q&A systems.

This is the latest step in a project that started in 2021 as a basic Python analyst script, grew into automatic PDF report generation, then became a fully automatic daily pipeline — and now a conversational AI anyone can query.

**Live demo:** https://uk-property-app.entzai.com/

**My Space:** https://huggingface.co/spaces/Entz/uk-property-app
---

### The problem with LLMs over tabular data

The naive approach — dump your CSV into the context window — breaks down fast at scale. The raw HM Land Registry file is a 4–5GB CSV covering 32 million transactions across England & Wales. I filtered it to ~1.76M London transactions (2010–2026), but even that doesn't fit in any context window, and even if it did, asking GPT to do GROUP BY in its head is asking for hallucinations.

The solution: **a structured analytics layer between the LLM and the data.**

---

### Architecture

User query (natural language)
        ↓
  Triage Agent (GPT)
  — classifies intentions.
  — extracts structured params: district, property type, 
    new/old build, time window, metric (median/mean/count)
        ↓
  Analytics Engine (pure Python + Pandas)
  — queries pre-aggregated Parquet files (not raw CSV)
  — returns structured JSON
        ↓
  Synthesis Agent (GPT)
  — receives structured JSON, writes prose analysis
  — hard rules prevent hallucination of years, ranges, stats
        ↓
  Chart Agent (Matplotlib)
  — 10 chart types: line, multi-line, stacked bar, h-bar, diverging bar,
    band trend, growth ranking, table
  — returned as base64 PNG → gr.Image on frontend

Key design decisions

Pre-aggregated Parquets, not raw CSV
Raw CSV is 4–5GB. Instead: aggregated data (~smaller than 100MB total) covering all dimension combos × 16 years. Query time: <50ms. Generated by a Python script that runs daily and automatically ETL — the same pipeline I built for the earlier PDF report system.

Triage before synthesis
GPT smaller model (fast, cheap) handles intent classification and parameter extraction. Then another GPT runs on the final synthesis step with clean structured data. Cost: ~$0.01–0.02/query.

Hard rules in synthesis prompt, not soft suggestions
LLMs reliably ignore "try not to..." instructions when they think context justifies it. For data accuracy I use hard "NEVER cite X unless user's question contains word Y" phrasing. Confirmed more reliable via A/B testing on edge cases.

Public/private Space split
Private backend (data, agents, pipeline) + public frontend (thin Gradio UI, gradio_client, supporting both UI and API communication). Frontend only has Pillow in requirements.txt.

Partial-year handling
Data includes 2026 (partial year). Charts always show it; This was one of hardest one to crack surprisingly.

Automated data pipeline
Scheduled notebook monitor daily whether the raw gov.uk dataset has been updated → if so, regenerates Parquets → uploads to data server → Space detects git commit and auto-restarts. Zero manual effort once deployed.

In this post