VLM Playground (PreviewSpace) — Product Requirements Document

Summary

An internal Gradio Blocks app for rapid, structured experimentation with a Vision-Language Model (initially dots.ocr). It mirrors the reference playground but is deliberately minimal: stateless by default, no run history, focused on feeling model performance. Supports PDF/image upload, page preview and navigation, page-range parsing, and result views (Markdown Render, Markdown Raw Text, Current Page JSON) with preserved scroll position. Designed to run locally or on Hugging Face Spaces.

Goals

Fast iteration: Upload, prompt, parse, iterate in seconds with minimal ceremony.
Model-light: Start with one model (dots.ocr), optional model selector later. No provider switching UI.
Structured output: First-class JSON output and markdown preview.
Stateless by default: No run history or persistence beyond the current browser session unless explicitly downloading.
Document-centric UX: Multi-page PDF preview, page navigation, per-page execution, and page-range parsing.

Non-Goals

Not a full labeling platform or production extraction pipeline.
Not a dataset hosting service or long-term data store for PHI.
Not a fine-tuning/training product; inference playground only.
No bounding-box drawing or manual annotation tools in v1.

Primary Users / Personas

Applied Researcher / Data Scientist: Tries different prompts/models, collects structured outputs.
ML Engineer: Prototypes pipelines, compares providers, validates latency/cost.
Domain Expert (e.g., Clinical Analyst): Uses curated templates to extract specific fields.

Key User Stories

As a user, I can upload a PDF or image, select a template prompt, and click Parse to see Markdown and JSON results.
As a user, I can preview pages, specify a page range to parse, and run per-page extraction.
As a user, I can jump to a specific page index in a PDF and use Prev/Next controls.
As a user, I can switch between result tabs (Markdown Render, Markdown Raw Text, Current Page JSON) without losing scroll position.
As a user, I can download the results for my current session as a ZIP or JSON/Markdown.
As a user, I can tweak the prompt and basic model settings and quickly re-run.

UX Requirements (inspired by dots.ocr playground)

Left Panel — Upload & Select
- Drag-and-drop or file picker for PNG/JPG/PDF; show file name and size.
- Optional Examples dropdown (curated sample docs and pre-baked prompts).
- File ingestion for PDFs extracts page thumbnails and page count.
Left Panel — Prompt & Actions
- Prompt Template select; Current Prompt editor (multiline with variable chips).
- Actions: Parse (primary), Clear (secondary).
- Show prompt variables, e.g., bbox, category, page_number.
Left Panel — Advanced Configuration
- Preprocessing toggle (fitz-like DPI upsample for low-res images).
- Minimal server/model config: Host/Port for local inference or a dropdown for on-host models.
- Page selection: single page, page range, or all.
Center — File Preview
- Large page preview with pan/zoom; page navigator (Prev/Next and page picker).
- Page jump field to go directly to page N.
Right Panel — Result Display
- Tabs: Markdown Render Preview, Markdown Raw Text, Current Page JSON.
- Preserve scroll position when switching tabs.
- Copy-to-clipboard and a Download Results button.

Functional Requirements

File Handling
- Accept PDF (up to 300 pages) and images (PNG/JPG/WebP). Max upload 50 MB (configurable).
- Extract page images for preview; store temp files locally (ephemeral) with TTL.
- Provide page-level selection and batching.
Prompting
- Template library with variables and descriptions. Variables can be sourced from UI state (page, bbox list) or user input.
- System prompt + user prompt fields; allow few-shot examples.
- Presets for common tasks (layout extraction, table extraction, key-value extraction, captioning).
Model Support
- Start with dots.ocr via the official parser or REST endpoint.
- Optional: dropdown to switch among dots.ocr model variants if present on the host. No cross-provider switching UI.
Execution
- Run per-page or whole-document, controlled by UI. Concurrency limit (default 3).
- Timeouts and retries surfaced to UI; cancellation supported.
- Caching: request hash on (file checksum, page, prompt, params, model) to avoid recomputation.
Outputs
- Markdown Render, Raw Markdown, and Current Page JSON.
- Export: Download button to export combined Markdown, per-page JSONL, and all artifacts as a ZIP.
Examples Gallery
- Preloaded example docs and templates to demonstrate patterns (OCR table, K/V extraction, figure captioning, layout detection).
Observability
- Show basic runtime info (latency, model id) inline; no history or centralized logs in v1.

Data Model (high-level)

In-memory, per-session structures only; no database.
Document: id, name, type, checksum, page count, temp storage path, created_at.
Page: document_id, page_index, image_path, width, height, preview_thumbnail.
Template: id, name, description, model_defaults, prompt_text, output_schema (optional JSON Schema), variables.

JSON Output Guidance

For structured tasks, templates may specify an output schema. The UI validates model JSON and highlights issues.
All results stored as JSON lines per page with summary aggregation.

Security & Compliance

Internal-only; access requires SSO or VPN.
Sensitive documents (e.g., PHI) processed only against approved providers/endpoints. Warn when a provider is external.
Ephemeral storage with TTL auto-clean; configurable retention. Redact logs where needed.

Performance Targets

Cold start to first parse: < 10s on typical PDFs (<= 20 pages) with network providers.
Per-page preview render: < 500ms after page image generation.
Concurrency: default 3 parallel page requests; configurable up to 10.
Throughput: 1,000 pages/day per user on average use without manual scaling.

Error States & Edge Cases

Unsupported file types or oversize files; clear messaging and guardrails.
Pages with extreme aspect ratios or very small text; suggest preprocessing.
Provider rate limits; exponential backoff and UI feedback.
Invalid model JSON; surface diffs and attempt best-effort JSON repair (opt-in).

Architecture (proposed)

App: Single Gradio Blocks app (Python). No separate backend required.
Execution: Use uv run locally. Designed to run as-is on Hugging Face Spaces.
Model: dots.ocr via local parser or REST endpoint; configurable host/port.
Storage: Ephemeral /tmp/previewspace/*; cleared at session end or TTL.
Caching: Optional on-disk cache keyed by content hash + prompt + params + model.

API Surface (v1)

Pure Gradio callbacks; no public REST API. Optional: expose simple /healthz.

Templates (initial set)

Layout Extraction: Return list of elements with bbox, category, and text within bbox.
Table Extraction: Return rows/columns as structured JSON; include confidence and cell bboxes.
Key-Value Extraction: Extract specified fields with locations and normalized values.
Captioning/Description: Summarize or caption selected regions or whole pages.

Privacy-by-Design Defaults

Local processing preferred where possible; clear visual indicator when sending to external APIs.
Redaction utilities for logs; toggle to disable request logging entirely.

Success Metrics

Time-to-first-result after upload.
Number of saved runs and templates re-used.
Reduction in manual extraction time for a representative task.
User satisfaction (quick pulse after saved runs).

Release Plan

M1 (v0.1) — Core Playground
- Upload PDF/image; page preview and navigation.
- Parse with one provider; show Markdown and JSON; save runs; export JSON.
- Basic provider config (host/port/api key) and preprocessing toggle.
- Acceptance: A user can replicate a layout extraction example end-to-end in < 2 minutes.
M2 (v0.2) — Templates, Regions, and Examples
- Template library + editor; draw/save bboxes; per-page runs; examples gallery.
- Multiple providers; concurrency and caching; logs and token usage.
- Acceptance: A user can create a new template with variables and run it across 10 pages with regions in one click.
M3 (v0.3) — Projects and Evals
- Projects grouping; batch runs over documents; dataset export; simple eval harness with spot checks.
- Acceptance: A user can run a project over 100 pages and export an evaluation-ready JSONL in < 10 minutes.

Open Questions

Do we require strict JSON schema validation with auto-repair, or soft validation with warnings?
What are the approved external providers for sensitive documents?
Should we include table renderers in the UI, or keep to JSON/Markdown only?
How long should run artifacts persist by default (e.g., 7 days)?

Risks & Mitigations

External API variability: Abstract through connectors; provide stubs/mocks for local dev.
Document diversity: Offer preprocessing toggles and template variables; maintain an examples gallery.
Cost visibility: Track token usage and estimated cost per run; warn when large batches are selected.

Appendices

Example: Layout Extraction Prompt (concept)

System: You are a vision-language model that outputs structured JSON only.
User: Please output the layout information from the PDF page image. For each element, return:
- bbox: [x1, y1, x2, y2] in image pixels
- category: string label from {"title","header","paragraph","table","figure","footnote"}
- text: content within bbox
Return JSON: {"elements": [{"bbox": [..], "category": "..", "text": ".."}], "page": <number>}.