AI Coding Assistants Keep Shipping Vulnerable Code -- Here's What We're Doing About It

Community Article Published February 26, 2026

Yesterday, researchers disclosed multiple RCE vulnerabilities in Anthropic's Claude Code. Last month, DeepSeek-R1 was found generating vulnerable code patterns at alarming rates. Checkmarx reports AI-generated code now accounts for 60%+ of some codebases -- much of it containing known vulnerabilities.

The pattern is clear: AI is writing more of our code, but nobody is training it to write secure code.

That's why we built SecureCode -- the largest open security training dataset for AI coding assistants. Today we're sharing a major update.

What's New

SecureCode is now a dataset family with three components:

Dataset Examples Focus Load It
SecureCode 2,185 Everything (with HF configs) load_dataset("scthornton/securecode")
SecureCode Web 1,378 OWASP Top 10 2021 load_dataset("scthornton/securecode-web")
SecureCode AI/ML 750 OWASP LLM Top 10 2025 load_dataset("scthornton/securecode-aiml")

Framework-specific security patterns (new!): 219 examples covering idiomatic security for Express.js, Django, Spring Boot, Flask, Rails, Laravel, ASP.NET Core, FastAPI, and NestJS. Not just "sanitize your inputs" -- these teach models how each framework's built-in security features actually work.

Parquet format everywhere: All datasets now load cleanly with datasets >= 4.0.0. No deprecated loading scripts, no trust_remote_code.

8 fine-tuned models: We trained models from 3B to 20B parameters (Llama, Qwen, CodeGemma, DeepSeek, CodeLlama, StarCoder2, Granite) -- all open on the Model Collection.

Why This Matters Now

Every example in SecureCode is grounded in a real CVE or security incident. Not synthetic. Not hypothetical. Real breaches:

  • Equifax (CVE-2017-5638) -- $425M from a Struts RCE
  • MOVEit (CVE-2023-34362) -- 77M+ individuals affected
  • Capital One SSRF -- 100M customer records

Each example follows a 4-turn conversation that mirrors how developers actually interact with AI assistants, complete with SIEM rules, logging patterns, and defense-in-depth guidance.

If you're fine-tuning models for code generation, please consider including SecureCode in your training mix. The 45% vulnerable code rate from AI assistants isn't going to fix itself.

Paper: arxiv.org/abs/2512.18542

Community

Sign up or log in to comment