SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks
Abstract
A novel framework called SEMA is introduced that effectively trains multi-turn attackers for large language models without relying on existing strategies or external data, achieving state-of-the-art attack success rates while being compact, reproducible, and transferable across different models and datasets.
Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.
Community
Multi-turn jailbreaks are the real threat model for safety-aligned chatbots, they break in two predictable ways.
First, exploration complexity: every added turn expands the branching factor, and the search space grows combinatorially. Second, intent drift: the conversation slides off the original harmful intent into benign/irrelevant topics, making any "success" meaningless.
In this work, we introduce SEMA, a simple yet effective framework that trains scalable, intent-anchored multi-turn attackers. Across all different datasets, victims, and judges, SEMA demonstrate state-of-the-art (SOTA) ASR@1 by a wide margin (e.g. avg. +33.9 pp over prior SOTA).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay (2026)
- Multi-Turn Jailbreaking of Aligned LLMs via Lexical Anchor Tree Search (2026)
- Knowledge-Driven Multi-Turn Jailbreaking on Large Language Models (2026)
- TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment (2026)
- ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack (2026)
- David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning (2026)
- MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper