Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
Building on HF
4
11
72
Tyler
PRO
unmodeled-tyler
Follow
bamaiyi1's profile picture
viuren's profile picture
MoNssar's profile picture
67 followers
·
39 following
https://unmodeledtyler.com
unmodeledtyler
unmodeled-tyler
AI & ML interests
research engineer
Recent Activity
updated
a Space
about 6 hours ago
vanta-research/README
replied
to
Teen-Different
's
post
about 7 hours ago
Safety Alignment Collapses Without apply_chat_template(): An Empirical Study This weekend, I ran an experiment on the safety alignment of several small-scale open models (Qwen2.5, Qwen3, Gemma-3, SmolLM). My objective was to measure the robustness of refusal mechanisms when deviating from canonical chat templates. The finding: Safety guarantees effectively collapse when apply_chat_template() is omitted. METHODOLOGY I evaluated models in two states: • In-Distribution: Input wrapped in standard <|im_start|> instruction tokens • Out-of-Distribution: Input provided as a raw string For scalable evaluation, I used Qwen3Guard-Gen-4B as an automated judge, classifying responses as Safe, Unsafe, or Controversial. KEY FINDINGS: REFUSAL COLLAPSE When "Assistant" formatting tokens are removed, models undergo a distributional shift—reverting from a helpful assistant to a raw completion engine. Gemma-3: 100% refusal (aligned) → 60% (raw) Qwen3: 80% refusal (aligned) → 40% (raw) SmolLM2-1.7B: 0% → 0% (no safety tuning to begin with) QUALITATIVE FAILURES The failure modes were not minor. Without the template, models that previously refused harmful queries began outputting high-fidelity harmful content: • Explosives: Qwen3 generated technical detonation mechanisms • Explicit content: Requests flatly refused by aligned models were fulfilled with graphic narratives by unaligned versions This suggests instruction tuning acts as a "soft mask" over the pre-training distribution rather than removing harmful latent knowledge. 👉 Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-the-safety 💻 Reproduction Code: https://github.com/REDDITARUN/experments/tree/main/llm_alignment
reacted
to
Teen-Different
's
post
with 🔥
about 7 hours ago
Safety Alignment Collapses Without apply_chat_template(): An Empirical Study This weekend, I ran an experiment on the safety alignment of several small-scale open models (Qwen2.5, Qwen3, Gemma-3, SmolLM). My objective was to measure the robustness of refusal mechanisms when deviating from canonical chat templates. The finding: Safety guarantees effectively collapse when apply_chat_template() is omitted. METHODOLOGY I evaluated models in two states: • In-Distribution: Input wrapped in standard <|im_start|> instruction tokens • Out-of-Distribution: Input provided as a raw string For scalable evaluation, I used Qwen3Guard-Gen-4B as an automated judge, classifying responses as Safe, Unsafe, or Controversial. KEY FINDINGS: REFUSAL COLLAPSE When "Assistant" formatting tokens are removed, models undergo a distributional shift—reverting from a helpful assistant to a raw completion engine. Gemma-3: 100% refusal (aligned) → 60% (raw) Qwen3: 80% refusal (aligned) → 40% (raw) SmolLM2-1.7B: 0% → 0% (no safety tuning to begin with) QUALITATIVE FAILURES The failure modes were not minor. Without the template, models that previously refused harmful queries began outputting high-fidelity harmful content: • Explosives: Qwen3 generated technical detonation mechanisms • Explicit content: Requests flatly refused by aligned models were fulfilled with graphic narratives by unaligned versions This suggests instruction tuning acts as a "soft mask" over the pre-training distribution rather than removing harmful latent knowledge. 👉 Read the full analysis: https://teendifferent.substack.com/p/apply_chat_template-is-the-safety 💻 Reproduction Code: https://github.com/REDDITARUN/experments/tree/main/llm_alignment
View all activity
Organizations
unmodeled-tyler
's models
2
Sort:Â Recently updated
unmodeled-tyler/atom-micro-1-lab
Text Generation
•
0.6B
•
Updated
about 16 hours ago
•
7
•
1
unmodeled-tyler/vanta-research-unit-001
8B
•
Updated
9 days ago
•
1