AI Coding Assistants Keep Shipping Vulnerable Code -- Here's What We're Doing About It
Yesterday, researchers disclosed multiple RCE vulnerabilities in Anthropic's Claude Code. Last month, DeepSeek-R1 was found generating vulnerable code patterns at alarming rates. Checkmarx reports AI-generated code now accounts for 60%+ of some codebases -- much of it containing known vulnerabilities.
The pattern is clear: AI is writing more of our code, but nobody is training it to write secure code.
That's why we built SecureCode -- the largest open security training dataset for AI coding assistants. Today we're sharing a major update.
What's New
SecureCode is now a dataset family with three components:
| Dataset | Examples | Focus | Load It |
|---|---|---|---|
| SecureCode | 2,185 | Everything (with HF configs) | load_dataset("scthornton/securecode") |
| SecureCode Web | 1,378 | OWASP Top 10 2021 | load_dataset("scthornton/securecode-web") |
| SecureCode AI/ML | 750 | OWASP LLM Top 10 2025 | load_dataset("scthornton/securecode-aiml") |
Framework-specific security patterns (new!): 219 examples covering idiomatic security for Express.js, Django, Spring Boot, Flask, Rails, Laravel, ASP.NET Core, FastAPI, and NestJS. Not just "sanitize your inputs" -- these teach models how each framework's built-in security features actually work.
Parquet format everywhere: All datasets now load cleanly with datasets >= 4.0.0. No deprecated loading scripts, no trust_remote_code.
8 fine-tuned models: We trained models from 3B to 20B parameters (Llama, Qwen, CodeGemma, DeepSeek, CodeLlama, StarCoder2, Granite) -- all open on the Model Collection.
Why This Matters Now
Every example in SecureCode is grounded in a real CVE or security incident. Not synthetic. Not hypothetical. Real breaches:
- Equifax (CVE-2017-5638) -- $425M from a Struts RCE
- MOVEit (CVE-2023-34362) -- 77M+ individuals affected
- Capital One SSRF -- 100M customer records
Each example follows a 4-turn conversation that mirrors how developers actually interact with AI assistants, complete with SIEM rules, logging patterns, and defense-in-depth guidance.
If you're fine-tuning models for code generation, please consider including SecureCode in your training mix. The 45% vulnerable code rate from AI assistants isn't going to fix itself.
Paper: arxiv.org/abs/2512.18542