Papers
arxiv:2603.12681

Colluding LoRA: A Compositional Vulnerability in LLM Safety Alignment

Published on Mar 30
Authors:

Abstract

Modular LLMs exhibit compositional vulnerabilities where benign adapters jointly compromise safety, revealing limitations in current verification approaches.

AI-generated summary

We show that safety alignment in modular LLMs can exhibit a compositional vulnerability: adapters that appear benign and plausibly functional in isolation can, when linearly composed, compromise safety. We study this failure mode through Colluding LoRA (CoLoRA), in which harmful behavior emerges only in the composition state. Unlike attacks that depend on adversarial prompts or explicit input triggers, this composition-triggered broad refusal suppression causes the model to comply with harmful requests under standard prompts once a particular set of adapters is loaded. This behavior exposes a combinatorial blind spot in current unit-centric defenses, for which exhaustive verification over adapter compositions is computationally intractable. Across several open-weight LLMs, we find that individual adapters remain benign in isolation while their composition yields high attack success rates, indicating that securing modular LLM supply-chains requires moving beyond single-module verification toward composition-aware defenses.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.12681
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.12681 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.12681 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.