AI Coding Assistants Keep Shipping Vulnerable Code -- Here's What We're Doing About It

Community Article Published February 26, 2026

Yesterday, researchers disclosed multiple RCE vulnerabilities in Anthropic's Claude Code. Last month, DeepSeek-R1 was found generating vulnerable code patterns at alarming rates. Checkmarx reports AI-generated code now accounts for 60%+ of some codebases -- much of it containing known vulnerabilities.

The pattern is clear: AI is writing more of our code, but nobody is training it to write secure code.

That's why we built SecureCode -- the largest open security training dataset for AI coding assistants. Today we're sharing a major update.

What's New

SecureCode is now a dataset family with three components:

Dataset	Examples	Focus	Load It
SecureCode	2,185	Everything (with HF configs)	`load_dataset("scthornton/securecode")`
SecureCode Web	1,378	OWASP Top 10 2021	`load_dataset("scthornton/securecode-web")`
SecureCode AI/ML	750	OWASP LLM Top 10 2025	`load_dataset("scthornton/securecode-aiml")`

Framework-specific security patterns (new!): 219 examples covering idiomatic security for Express.js, Django, Spring Boot, Flask, Rails, Laravel, ASP.NET Core, FastAPI, and NestJS. Not just "sanitize your inputs" -- these teach models how each framework's built-in security features actually work.

Parquet format everywhere: All datasets now load cleanly with datasets >= 4.0.0. No deprecated loading scripts, no trust_remote_code.

8 fine-tuned models: We trained models from 3B to 20B parameters (Llama, Qwen, CodeGemma, DeepSeek, CodeLlama, StarCoder2, Granite) -- all open on the Model Collection.

Why This Matters Now

Every example in SecureCode is grounded in a real CVE or security incident. Not synthetic. Not hypothetical. Real breaches:

Equifax (CVE-2017-5638) -- $425M from a Struts RCE
MOVEit (CVE-2023-34362) -- 77M+ individuals affected
Capital One SSRF -- 100M customer records

Each example follows a 4-turn conversation that mirrors how developers actually interact with AI assistants, complete with SIEM rules, logging patterns, and defense-in-depth guidance.

If you're fine-tuning models for code generation, please consider including SecureCode in your training mix. The 45% vulnerable code rate from AI assistants isn't going to fix itself.

Paper: arxiv.org/abs/2512.18542

Datasets mentioned in this article 3

Collections mentioned in this article 1

Teaching Code Models to Write Secure Code: The SecureCode Collection

January 25, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote