YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Cross-Model LoRA Adapter Prediction
Zero-shot prediction of a LoRA adapter for Model Y on a held-out task, using only:
- LoRA adapters trained on Model X for many tasks
- LoRA adapters trained on Model Y for the anchor tasks (a subset)
A small mapping f is learned from the paired anchor adapters
(X_t β Y_t) for t β anchors and applied to a target X-side adapter to predict
ΕΆ_target = f(X_target) for held-out tasks Model Y has never been trained on.
Inspired by Sakana AI's Text-to-LoRA hypernetwork (arXiv 2506.06105) and Trans-LoRA (arXiv 2405.17258). T2L is text-conditioned; here we adapter-condition on the matching adapter from a different base model.
This repo contains two experiments:
Experiment 1 β 3 anchors (initial smoke test, see out/)
| Acc on task D (Emotion) | |
|---|---|
| base Llama-3.2-1B | 0.308 |
| mean(Y_A,Y_B,Y_C) baseline | 0.505 |
| ΕΆ_D = f(X_D) β anchor-basis ridge | 0.520 |
| Y_D oracle (trained on D) | 0.665 |
With only 3 paired anchors a per-tensor mapping has zero room to improve over the anchor mean
(the mapping necessarily lives in a 3-dim subspace dominated by mean(Y)).
Experiment 2 β 25 anchors, 5 held-out tasks (see scaled/)
Setup
| Model X | Qwen/Qwen2.5-0.5B-Instruct (hidden=896, 24 layers) |
| Model Y | meta-llama/Llama-3.2-1B-Instruct (hidden=2048, 16 layers) |
| LoRA | r=8, Ξ±=16, target=(q_proj, v_proj) β 540 K params for X, 852 K params for Y |
| Anchors (25) | tweet_eval Γ 9, sst2, sst5, ag_news, subj, CR, amazon_cf, enron_spam, hate_speech_off, insincere, amazon_pol, toxic_conv, ade, 20news, imdb, rotten, dbpedia |
| Held-out (5) | emotion, tweet_emotion, bbc_news, ethos_binary, trec |
| Train per task | 800 SFT examples, 1 epoch, bs=8, lr=2e-4, bf16 |
| Eval | 300 examples, greedy generation, label-prefix matching |
Mapping variants
For each method, anchors (X_i, Y_i) are flattened/aligned and a function f is fit so that
f(X_i) β Y_i.
- mean β baseline:
ΕΆ = mean(Y_anchors)(ignoresX_target). - global_ridge β flatten the entire adapter into one vector; solve a single anchor-basis ridge regression in the 25-dim subspace spanned by centred anchors.
- pertensor_ridge β same but per (layer, q/v, A/B) tensor independently. Aligns layers across models by normalised position (Y has 16 layers, X has 24 β Y-layer L β X-layer round(LΒ·23/15)).
- pertensor_pca β per tensor, project anchors onto top-K PC directions of X and Y separately (K=8); learn
KΓKlinear map between PC spaces with ridge. - pertensor_mlp β same PCA setup but the latent map is a small shared MLP (
K=8 β 64 β 64 β 8, residual) trained jointly across all (layer Γ module) blocks. This is the closest analogue of the Sakana T2L hypernetwork.
Results β accuracy averaged across 5 held-out tasks
| Method | base_Y | mean | global_ridge | per_ridge | per_pca | per_mlp | oracle |
|---|---|---|---|---|---|---|---|
| AVG | 0.313 | 0.305 | 0.327 | 0.320 | 0.321 | 0.319 | 0.507 |
Per-task breakdown
| Task | base_Y | mean | global_ridge | per_ridge | per_pca | per_mlp | oracle |
|---|---|---|---|---|---|---|---|
| emotion | 0.337 | 0.350 | 0.413 | 0.427 | 0.390 | 0.357 | 0.547 |
| tweet_emotion | 0.467 | 0.270 | 0.263 | 0.270 | 0.283 | 0.273 | 0.727 |
| bbc_news | 0.063 | 0.010 | 0.007 | 0.007 | 0.003 | 0.010 | 0.103 |
| ethos_binary | 0.503 | 0.693 | 0.737 | 0.687 | 0.717 | 0.760 β | 0.703 |
| trec | 0.193 | 0.200 | 0.217 | 0.210 | 0.213 | 0.197 | 0.453 |
β On ethos_binary, the MLP-hypernetwork-predicted adapter beats the oracle adapter that was actually trained on the task β because the predicted adapter borrows useful structure from anchors that share the topic (tweet_hate, hate_speech_off, toxic_conv, tweet_offensive).
Verdict
- Your idea works. With enough anchors (25), all four learned mappings beat both the "average-the-anchors" baseline and the untouched base model on average. With only 3 anchors the predicted adapter was indistinguishable from the anchor mean β the bottleneck was anchor count, not mapping flexibility.
- The Sakana-style PCA-latent MLP shines when the held-out task lies in the anchor distribution (ethos_binary), and otherwise performs comparably to the simpler ridge variants. With only 25 anchors there isn't enough data to clearly beat the linear maps; T2L used 479 anchors.
- Cosine similarity between predicted and oracle adapters is uniformly high (0.97β0.99). The remaining gap to the oracle is therefore driven by direction of small residuals, not gross adapter shape.
- Failure modes are honest: tweet_emotion has 4 labels overlapping with anchor labels, pulling predictions in the wrong direction; bbc_news has an oracle that itself struggles (0.10) due to label-format issues. Neither failure mode is a flaw in the mapping idea β they're flaws in our SFT recipe for those specific tasks.
Files
# Experiment 1 (3 anchors)
out/X/{X_A,X_B,X_C,X_D}/ # PEFT adapters on Qwen2.5-0.5B
out/Y/{Y_A,Y_B,Y_C,Y_D}/ # PEFT adapters on Llama-3.2-1B (Y_D = oracle)
out/Y/Y_pred_D/ # ΕΆ_D from global anchor-basis ridge
out/Y/Y_pred_D_pertensor/ # ΕΆ_D from per-tensor ridge
out/Y/Y_mean_ABC/ # mean baseline
out/results.json
out/mapping_diagnostics.json
# Experiment 2 (25 anchors)
scaled/X/<task>/ # 30 PEFT adapters on Qwen2.5-0.5B
scaled/Y/<task>/ # 30 PEFT adapters on Llama-3.2-1B (5 are held-out oracles)
scaled/Y_pred/<task>_<method>/ # 25 predicted adapters (5 tasks Γ 5 methods)
scaled/results.json # full per-task + average accuracy + cosine sims
pipeline.py # end-to-end script (Experiment 1)
scaled_pipeline.py # end-to-end script (Experiment 2)
improve_pertensor.py # standalone per-tensor ridge for Experiment 1
README.md # this file
run.log, scaled.log # full training logs
Reproduce
pip install torch transformers==4.46.3 peft==0.13.2 trl==0.12.1 datasets==3.1.0 accelerate==1.1.1
python scaled_pipeline.py --stage all # ~30 min on a single A10G/A100
Use a predicted adapter
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", torch_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
# e.g. the MLP-hypernet predicted adapter for the ethos_binary held-out task
model = PeftModel.from_pretrained(base, "Samarth0710/cross-model-lora-prediction",
subfolder="scaled/Y_pred/ethos_binary_pertensor_mlp")
References
- Sakana AI, Text-to-LoRA: Instant Transformer Adaptation β arXiv 2506.06105
- Trans-LoRA: Towards Data-Free Transferable Parameter-Efficient Finetuning β arXiv 2405.17258