Papers
arxiv:2602.06854

SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Published on Feb 6
· Submitted by
Mingqian Feng
on Feb 9
Authors:
,
,
,
,
,
,

Abstract

A novel framework called SEMA is introduced that effectively trains multi-turn attackers for large language models without relying on existing strategies or external data, achieving state-of-the-art attack success rates while being compact, reproducible, and transferable across different models and datasets.

AI-generated summary

Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average 80.1% ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.

Community

Paper submitter

Multi-turn jailbreaks are the real threat model for safety-aligned chatbots, they break in two predictable ways.

First, exploration complexity: every added turn expands the branching factor, and the search space grows combinatorially. Second, intent drift: the conversation slides off the original harmful intent into benign/irrelevant topics, making any "success" meaningless.

In this work, we introduce SEMA, a simple yet effective framework that trains scalable, intent-anchored multi-turn attackers. Across all different datasets, victims, and judges, SEMA demonstrate state-of-the-art (SOTA) ASR@1 by a wide margin (e.g. avg. +33.9 pp over prior SOTA).

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06854 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.06854 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06854 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.