YouTubeIntel
From raw YouTube comments to themes, audience positions, editorial briefings, and moderation-ready creator signals
YouTubeIntel is a portfolio-grade, production-style analytics platform that treats YouTube comments as operational data β not just sentiment fluff.
It combines a full-stack product surface with two substantial pipelines:
- a Topic Intelligence Pipeline for clustering, labeling, briefing, and audience positions;
- an Appeal Analytics Pipeline for criticism, questions, appeals, toxicity, and moderation support.
π Quick Start Β· π Architecture Β· π¬ Topic Pipeline Β· π― Appeal Pipeline Β· π‘ API
π Desktop Version Available
If this project makes you think βthis should have more starsβ β youβre probably right β
The pitch
Most YouTube analytics tools stop at vanity metrics.
They tell you:
- views,
- engagement,
- rough sentiment,
- maybe some keyword clouds.
They usually do not tell you:
- what the audience is actually arguing about;
- which positions exist inside each topic;
- what comments are actionable for the next episode;
- what criticism is genuinely constructive;
- which toxic messages should be reviewed or escalated.
YouTubeIntel is built to solve exactly that.
Why this repo is worth starring
β¨ What the product does today
Core capabilities
- Topic Intelligence Pipeline for semantic clustering, topic labeling, audience-position extraction, and daily briefing generation
- Appeal Analytics Pipeline for author-directed criticism, questions, appeals, and toxicity routing
- Hybrid moderation with rule-based filtering, optional LLM borderline moderation, toxic target detection, and review queues
- Operator UI for runs, reports, runtime settings, budget visibility, and appeal review
- Asynchronous processing via Celery + Redis
- Docker-first environment with PostgreSQL, Redis, backend, worker, and beat
- Desktop companion in
desktop/for local packaged delivery
Operator-facing routes
/uiβ dashboard shell/ui/videosβ recent videos + status monitor/ui/budgetβ budget + runtime settings/ui/reports/:videoIdβ full topic report detail/ui/appeal/:videoIdβ appeal analytics + toxic moderation workflow/docsβ Swagger UI
π Architecture
YouTube Data API -> Comment fetch -> Preprocess + moderation
-> Topic Intelligence Pipeline -> Markdown/HTML reports
-> Topic Intelligence Pipeline -> PostgreSQL
-> Appeal Analytics Pipeline -> PostgreSQL
React + Vite SPA -> FastAPI -> PostgreSQL
Celery worker -> FastAPI
Celery beat -> Celery worker
Redis -> Celery worker
Redis -> Celery beat
Stack overview
| Layer | Technology |
|---|---|
| API layer | FastAPI |
| Persistence | SQLAlchemy + PostgreSQL |
| Background execution | Celery + Redis |
| ML / NLP | SentenceTransformers, HDBSCAN, scikit-learn |
| LLM layer | OpenAI-compatible chat + embeddings |
| Frontend | React 18, TypeScript, Vite |
| Desktop packaging companion | desktop/ |
π¬ Topic Intelligence Pipeline
Entry point:
app/services/pipeline/runner.pyβDailyRunService
This is the βwhat is the audience discussing, how are they split, and what should the creator do next?β pipeline.
1) Context
-> 2) Comments fetch
-> 3) Preprocess + moderation
-> 4) Persist comments
-> 5) Embeddings
-> 6) Clustering
-> 7) Episode match (compatibility stage, skipped)
-> 8) Labeling + audience positions
-> 9) Persist clusters
-> 10) Briefing build
-> 11) Report export
What happens in each stage
| # | Stage | What it does |
|---|---|---|
| 1 | Context | Loads prior report context for continuity |
| 2 | Comments fetch | Pulls comments via the YouTube Data API |
| 3 | Preprocess + moderation | Filters low-signal items, normalizes text, applies rule-based moderation, optionally runs borderline LLM moderation |
| 4 | Persist comments | Stores processed comments in PostgreSQL |
| 5 | Embeddings | Builds vectors with local embeddings or OpenAI |
| 6 | Clustering | Groups comments into themes with fallback logic for edge cases |
| 7 | Episode match | Preserved as a compatibility stage and explicitly skipped in the active runtime |
| 8 | Labeling + audience positions | Produces titles, summaries, and intra-topic positions |
| 9 | Persist clusters | Saves clusters and membership mappings |
| 10 | Briefing build | Generates executive summary, actions, risks, and topic-level findings |
| 11 | Report export | Writes Markdown/HTML reports and stores structured JSON |
Main outputs
- top themes by discussion weight
- representative quotes and question comments
- audience positions inside each theme
- editorial briefing for the next content cycle, including actions, misunderstandings, audience requests, risks, and trend deltas
- moderation and degradation diagnostics
π― Appeal Analytics Pipeline
Entry point:
app/services/appeal_analytics/runner.pyβAppealAnalyticsService
This is the βwhat is being said to the creator?β pipeline.
Load/fetch video comments
-> Unified LLM classification
-> Question refiner
-> Political criticism filter
-> Toxic target classification
-> Confidence + target routing:
- Auto-ban block
- Manual review block
- Ignore third-party insults
-> Persist appeal blocks
Persisted blocks
| Block | Meaning |
|---|---|
constructive_question |
creator-directed questions |
constructive_criticism |
criticism backed by an actual political argument |
author_appeal |
direct requests / appeals to the creator |
toxic_auto_banned |
toxic comments that passed final verification and were auto-moderated |
toxic_manual_review |
toxic comments queued for admin review |
skip is still used internally as a classification outcome, but it is not persisted as a block.
Pipeline behaviors that matter
- question candidates get a second-pass refiner;
- criticism with question signal can be promoted into the question block;
- low-value
attack_ragebait/meme_one_linerquestion candidates are demoted out ofconstructive_question; - toxic comments are classified by target (
author,guest,content,undefined,third_party); - routing splits comments into auto-ban, manual review, or ignore;
- comments with toxicity confidence >= 0.80 enter the auto-ban path, but a final strict verification pass can still downgrade them into manual review;
- auto-banned authors can be unbanned from the UI if the operator spots a false positive;
- per-video guest names can improve targeting accuracy.
π‘ API Reference
Interactive docs live at /docs when the backend is running.
Core runs and reports
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/health |
health and OpenAI endpoint metadata |
POST |
/run/latest |
run Topic Intelligence for the latest playlist video |
POST |
/run/video |
run Topic Intelligence for a specific video URL |
GET |
/videos |
recent videos |
GET |
/videos/statuses |
progress/status dashboard payload |
GET |
/videos/{video_id} |
single-video metadata |
GET |
/reports/latest |
latest report |
GET |
/reports/{video_id} |
latest report for one video |
GET |
/reports/{video_id}/detail |
enriched report with comments and positions |
Appeal analytics and moderation
| Method | Endpoint | Purpose |
|---|---|---|
POST |
/appeal/run |
run Appeal Analytics |
GET |
/appeal/{video_id} |
latest appeal analytics result |
GET |
/appeal/{video_id}/author/{author_name} |
all comments by one author |
GET |
/appeal/{video_id}/toxic-review |
manual toxic-review queue |
POST |
/appeal/ban-user |
manual ban action |
POST |
/appeal/unban-user |
restore a previously banned commenter |
Runtime and settings
| Method | Endpoint | Purpose |
|---|---|---|
GET |
/settings/video-guests/{video_id} |
load guest names |
PUT |
/settings/video-guests/{video_id} |
update guest names |
GET |
/budget |
OpenAI usage snapshot |
GET |
/settings/runtime |
current mutable runtime settings |
PUT |
/settings/runtime |
update mutable runtime settings |
GET |
/app/setup/status |
desktop-only first-run setup status |
POST |
/app/setup |
desktop-only save desktop bootstrap secrets |
PUT |
/app/setup |
desktop-only rotate desktop bootstrap secrets / OAuth values |
For request examples, see docs/requests.md.
π Quick Start
Docker Compose
cp .env-docker.example .env-docker
# Same template as `.env.example`, but DATABASE_URL / Celery URLs use Compose hostnames (`db`, `redis`).
# Fill in YOUTUBE_API_KEY, YOUTUBE_PLAYLIST_ID, OPENAI_API_KEY (and OAuth if you use moderation actions).
docker compose up --build
Services started:
- PostgreSQL
- Redis
- FastAPI backend
- Celery worker
- Celery beat
App URLs:
- UI:
http://localhost:8000/ui - Swagger:
http://localhost:8000/docs - Health:
http://localhost:8000/health
Local development
cp .env.example .env
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .\.venv\Scripts\Activate.ps1 # Windows PowerShell
pip install -U pip
pip install -r requirements-dev.txt
docker compose up -d db redis
alembic upgrade head
uvicorn app.main:app --reload
Frontend in a separate terminal:
npm --prefix frontend ci
npm --prefix frontend run dev
Workers in separate terminals:
celery -A app.workers.celery_app:celery_app worker --loglevel=INFO
celery -A app.workers.celery_app:celery_app beat --loglevel=INFO
βοΈ Configuration
Primary configuration surfaces:
Most important variables
| Variable | Why it matters |
|---|---|
YOUTUBE_API_KEY |
YouTube Data API access |
YOUTUBE_PLAYLIST_ID |
latest-video runs |
OPENAI_API_KEY |
classification, labeling, moderation |
OPENAI_MAX_USD_PER_RUN / OPENAI_HARD_BUDGET_ENFORCED |
optional per-run spend guardrails |
YOUTUBE_OAUTH_CLIENT_ID |
optional YouTube moderation / restore actions |
YOUTUBE_OAUTH_CLIENT_SECRET |
optional YouTube moderation / restore actions |
YOUTUBE_OAUTH_REFRESH_TOKEN |
optional YouTube moderation / restore actions |
DATABASE_URL |
PostgreSQL persistence |
CELERY_BROKER_URL |
Redis broker |
CELERY_RESULT_BACKEND |
Redis result storage |
EMBEDDING_MODE |
local or openai |
LOCAL_EMBEDDING_MODEL |
recommended local topic-clustering model |
AUTO_BAN_THRESHOLD |
appeal/toxic: auto-hide when UI confidence β₯ this (default 0.80) |
TOXIC_AUTOBAN_PRECISION_REVIEW_THRESHOLD |
should match AUTO_BAN_THRESHOLD so first-pass score is trusted; stricter second LLM only when score is below this |
Runtime notes
episode_match/ transcription fields still exist for compatibility, but that stage is skipped in the active runtime.- budget usage is tracked and visible via UI/API;
- guest-name configuration improves appeal/toxic targeting quality.
- for historical A/B testing of local embedding models, run
PYTHONPATH=. python scripts/benchmark_topic_models.py.
π Quality Gates
Recommended checks:
ruff check .
black --check .
pytest -q
( cd desktop && pytest -q )
npm --prefix frontend run build
CI in .github/workflows/ci.yml covers:
- Python lint / formatting (
ruff,black) on the full tree (includingdesktop/) - root
pytestanddesktop/pytest - frontend production build
π Project Structure
app/ FastAPI app, schemas, services, workers
frontend/ React SPA
alembic/ database migrations
tests/ pytest suite
scripts/ startup scripts for api/worker/beat
desktop/ desktop packaging companion
docs/PIPELINE.md pipeline-level notes
docs/requests.md endpoint request reference
Helpful companion docs
docs/PIPELINE.mddocs/requests.mddocs/README.mdapp/README.mdfrontend/README.mdtests/README.mddesktop/README.md
Final note
This repo is deliberately built to feel like a serious internal analytics product, not just a demo.
If you like projects that combine:
- real product thinking,
- non-trivial data/LLM pipelines,
- backend + frontend + infra,
- and a strong GitHub presentation,
YouTubeIntel was made for that exact intersection.
License
See LICENSE.