ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Abstract
Activation-Scaling Guard (ASGuard) mitigates brittle refusal behaviors in large language models by identifying and recalibrating specific attention heads vulnerable to tense-based jailbreaking attacks through mechanistic circuit analysis and targeted fine-tuning.
Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.
Community
Asguard is a novel mechanistic safety framework that mitigates targeted jailbreak vulnerabilities in LLMs by directly intervening in internal activation dynamics rather than relying solely on data-level alignment.
(1) Background:
Large language models exhibit brittle refusal behavior, where simple linguistic transformations (e.g., tense changes) can bypass safety alignment, revealing a generalization gap in existing alignment methods.
(2) Motivation:
Prior safety approaches lack mechanistic understanding of why jailbreaks succeed, making them ineffective against targeted attacks such as tense-based jailbreaks; this calls for interpretable, circuit-level interventions that preserve utility while improving robustness.
(3) Method:
ASGuard identifies attention heads causally responsible for jailbreak behavior via circuit analysis, learns channel-wise activation scaling to recalibrate these vulnerable components, and integrates this into preventative fine-tuning to enforce robust refusal while maintaining overall model performance.
Get this paper in your agent:
hf papers read 2509.25843 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
