Cross-Model LoRA Adapter Prediction

Zero-shot prediction of a LoRA adapter for Model Y on a held-out task, using only:

LoRA adapters trained on Model X for many tasks
LoRA adapters trained on Model Y for the anchor tasks (a subset)

A small mapping f is learned from the paired anchor adapters (X_t ↔ Y_t) for t ∈ anchors and applied to a target X-side adapter to predict Ŷ_target = f(X_target) for held-out tasks Model Y has never been trained on.

Inspired by Sakana AI's Text-to-LoRA hypernetwork (arXiv 2506.06105) and Trans-LoRA (arXiv 2405.17258). T2L is text-conditioned; here we adapter-condition on the matching adapter from a different base model.

This repo contains two experiments:

Experiment 1 — 3 anchors (initial smoke test, see `out/`)

	Acc on task D (Emotion)
base Llama-3.2-1B	0.308
mean(Y_A,Y_B,Y_C) baseline	0.505
Ŷ_D = f(X_D) — anchor-basis ridge	0.520
Y_D oracle (trained on D)	0.665

With only 3 paired anchors a per-tensor mapping has zero room to improve over the anchor mean (the mapping necessarily lives in a 3-dim subspace dominated by mean(Y)).

Experiment 2 — 25 anchors, 5 held-out tasks (see `scaled/`)

Setup


Model X	`Qwen/Qwen2.5-0.5B-Instruct` (hidden=896, 24 layers)
Model Y	`meta-llama/Llama-3.2-1B-Instruct` (hidden=2048, 16 layers)
LoRA	r=8, α=16, target=(q_proj, v_proj) — 540 K params for X, 852 K params for Y
Anchors (25)	tweet_eval × 9, sst2, sst5, ag_news, subj, CR, amazon_cf, enron_spam, hate_speech_off, insincere, amazon_pol, toxic_conv, ade, 20news, imdb, rotten, dbpedia
Held-out (5)	emotion, tweet_emotion, bbc_news, ethos_binary, trec
Train per task	800 SFT examples, 1 epoch, bs=8, lr=2e-4, bf16
Eval	300 examples, greedy generation, label-prefix matching

Mapping variants

For each method, anchors (X_i, Y_i) are flattened/aligned and a function f is fit so that f(X_i) ≈ Y_i.

mean — baseline: Ŷ = mean(Y_anchors) (ignores X_target).
global_ridge — flatten the entire adapter into one vector; solve a single anchor-basis ridge regression in the 25-dim subspace spanned by centred anchors.
pertensor_ridge — same but per (layer, q/v, A/B) tensor independently. Aligns layers across models by normalised position (Y has 16 layers, X has 24 → Y-layer L → X-layer round(L·23/15)).
pertensor_pca — per tensor, project anchors onto top-K PC directions of X and Y separately (K=8); learn K×K linear map between PC spaces with ridge.
pertensor_mlp — same PCA setup but the latent map is a small shared MLP (K=8 → 64 → 64 → 8, residual) trained jointly across all (layer × module) blocks. This is the closest analogue of the Sakana T2L hypernetwork.

Results — accuracy averaged across 5 held-out tasks

Method	base_Y	mean	global_ridge	per_ridge	per_pca	per_mlp	oracle
AVG	0.313	0.305	0.327	0.320	0.321	0.319	0.507

Per-task breakdown

Task	base_Y	mean	global_ridge	per_ridge	per_pca	per_mlp	oracle
emotion	0.337	0.350	0.413	0.427	0.390	0.357	0.547
tweet_emotion	0.467	0.270	0.263	0.270	0.283	0.273	0.727
bbc_news	0.063	0.010	0.007	0.007	0.003	0.010	0.103
ethos_binary	0.503	0.693	0.737	0.687	0.717	0.760 ⭐	0.703
trec	0.193	0.200	0.217	0.210	0.213	0.197	0.453

⭐ On ethos_binary, the MLP-hypernetwork-predicted adapter beats the oracle adapter that was actually trained on the task — because the predicted adapter borrows useful structure from anchors that share the topic (tweet_hate, hate_speech_off, toxic_conv, tweet_offensive).

Verdict

Your idea works. With enough anchors (25), all four learned mappings beat both the "average-the-anchors" baseline and the untouched base model on average. With only 3 anchors the predicted adapter was indistinguishable from the anchor mean — the bottleneck was anchor count, not mapping flexibility.
The Sakana-style PCA-latent MLP shines when the held-out task lies in the anchor distribution (ethos_binary), and otherwise performs comparably to the simpler ridge variants. With only 25 anchors there isn't enough data to clearly beat the linear maps; T2L used 479 anchors.
Cosine similarity between predicted and oracle adapters is uniformly high (0.97–0.99). The remaining gap to the oracle is therefore driven by direction of small residuals, not gross adapter shape.
Failure modes are honest: tweet_emotion has 4 labels overlapping with anchor labels, pulling predictions in the wrong direction; bbc_news has an oracle that itself struggles (0.10) due to label-format issues. Neither failure mode is a flaw in the mapping idea — they're flaws in our SFT recipe for those specific tasks.

Files

# Experiment 1 (3 anchors)
out/X/{X_A,X_B,X_C,X_D}/             # PEFT adapters on Qwen2.5-0.5B
out/Y/{Y_A,Y_B,Y_C,Y_D}/             # PEFT adapters on Llama-3.2-1B (Y_D = oracle)
out/Y/Y_pred_D/                      # Ŷ_D from global anchor-basis ridge
out/Y/Y_pred_D_pertensor/            # Ŷ_D from per-tensor ridge
out/Y/Y_mean_ABC/                    # mean baseline
out/results.json
out/mapping_diagnostics.json

# Experiment 2 (25 anchors)
scaled/X/<task>/                     # 30 PEFT adapters on Qwen2.5-0.5B
scaled/Y/<task>/                     # 30 PEFT adapters on Llama-3.2-1B (5 are held-out oracles)
scaled/Y_pred/<task>_<method>/       # 25 predicted adapters (5 tasks × 5 methods)
scaled/results.json                  # full per-task + average accuracy + cosine sims

pipeline.py                          # end-to-end script (Experiment 1)
scaled_pipeline.py                   # end-to-end script (Experiment 2)
improve_pertensor.py                 # standalone per-tensor ridge for Experiment 1
README.md                            # this file
run.log, scaled.log                  # full training logs

Reproduce

pip install torch transformers==4.46.3 peft==0.13.2 trl==0.12.1 datasets==3.1.0 accelerate==1.1.1
python scaled_pipeline.py --stage all   # ~30 min on a single A10G/A100

Use a predicted adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", torch_dtype=torch.bfloat16)
tok  = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
# e.g. the MLP-hypernet predicted adapter for the ethos_binary held-out task
model = PeftModel.from_pretrained(base, "Samarth0710/cross-model-lora-prediction",
                                  subfolder="scaled/Y_pred/ethos_binary_pertensor_mlp")

References

Sakana AI, Text-to-LoRA: Instant Transformer Adaptation — arXiv 2506.06105
Trans-LoRA: Towards Data-Free Transferable Parameter-Efficient Finetuning — arXiv 2405.17258

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support