SANA-Video Tom & Jerry LoRA (rank 16)
LoRA adapter for SANA-Video 2B fine-tuned on a small dataset of Tom & Jerry style clips at 224×224 resolution. The goal is to adapt the base 480p SANA-Video model toward classic 2D slapstick cartoon style while keeping its text-conditioning ability.
- Base model:
Efficient-Large-Model/SANA-Video_2B_480p_diffusers - Task: class-style video generation ("Tom & Jerry" slapstick cartoon)
- Backbone: SANA-Video transformer only (VAE + text encoder frozen)
- Fine-tuning: LoRA on attention / MLP layers
- Resolution: 224×224 crops (clips ~5 seconds @ 16 FPS)
Checkpoints
This LoRA was trained step-wise and evaluated by generating the same prompt at different training steps.
Each of these checkpoints has a short GIF preview generated from the same prompt and seed to visualize how style evolves over training.
Usage (Diffusers + PEFT)
This repo exposes raw LoRA checkpoints (lora_step_*.pt).
Here is a minimal example that:
- Loads the base SANA-Video 2B 480p pipeline.
- Wraps the transformer with a LoRA adapter (same config as training).
- Loads a single LoRA checkpoint and runs one prompt.
from pathlib import Path
import torch
from diffusers import SanaVideoPipeline
from diffusers.utils import export_to_video
from peft import LoraConfig, get_peft_model
from peft.utils import set_peft_model_state_dict
# ---- Paths / config ----
MODEL_ID = "Efficient-Large-Model/SANA-Video_2B_480p_diffusers"
LORA_PATH = Path("lora_step_010000.pt") # example checkpoint
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.0
LORA_TARGET_MODULES = ["proj_out", "to_q", "to_v", "to_k", "linear_2", "linear_1", "linear"]
PROMPT = (
"A vintage slapstick 2D cartoon scene of a grey cat chasing a small brown mouse "
"in a colorful house, Tom and Jerry style, bold outlines, limited color palette, "
"exaggerated expressions, smooth character motion."
)
# ---- Load base pipeline ----
pipe = SanaVideoPipeline.from_pretrained(
MODEL_ID,
torch_dtype=DTYPE,
)
pipe.vae.to(DEVICE, dtype=torch.float32) # VAE in fp32 is more stable
# ---- Wrap transformer with LoRA ----
lora_cfg = LoraConfig(
r=LORA_R,
lora_alpha=LORA_ALPHA,
lora_dropout=LORA_DROPOUT,
bias="none",
target_modules=LORA_TARGET_MODULES,
)
pipe.transformer = get_peft_model(pipe.transformer, lora_cfg)
# ---- Load LoRA weights (handles torch.compile prefixes) ----
state = torch.load(LORA_PATH, map_location="cpu")
fixed_state = {}
for k, v in state.items():
if k.startswith("_orig_mod."):
k = k[len("_orig_mod."):]
if k.startswith("module."):
k = k[len("module."):]
fixed_state[k] = v
set_peft_model_state_dict(pipe.transformer, fixed_state)
# ---- Move to device ----
pipe.to(DEVICE)
pipe.transformer.to(DEVICE, dtype=DTYPE)
pipe.text_encoder.to(DEVICE, dtype=DTYPE)
# ---- Inference ----
with torch.no_grad():
out = pipe(
prompt=[PROMPT],
num_inference_steps=50,
guidance_scale=4.0,
height=224,
width=224,
use_resolution_binning=False,
)
video_frames = out.frames[0] if hasattr(out, "frames") else out.videos[0]
export_to_video(video_frames, "tomjerry_lora_sample.mp4", fps=16)
print("Saved to tomjerry_lora_sample.mp4")
Training setup
- Objective: Rectified Flow / Flow-Matching on latents.
- Latents: pre-encoded video latents from SANA-Video VAE.
- Frozen modules: VAE + text encoder.
- Trainable: transformer via LoRA.
LoRA config
r = 16alpha = 32dropout = 0.1- Target modules:
proj_out,to_q,to_k,to_v,linear_1,linear_2,linear.
Optimization
- Optimizer:
AdamW/AdamW8bit(bitsandbytes). - Learning rate:
2e-4. - Batch size:
8videos (5s @ 16 FPS, 224×224). - LR schedule: linear warmup then cosine decay.
- Precision:
bf16(torch.set_float32_matmul_precision("medium")). - Training length: up to ~10k effective steps for this checkpoint.
Limitations & notes
- This LoRA is trained on a small, stylized dataset, so it strongly biases toward a cat–mouse slapstick setup. Out-of-domain prompts may produce strange results.
- Not an official Tom and Jerry model; this is a research experiment exploring class-style adaptation and residual text control on SANA-Video.
- Use responsibly and respect relevant copyright when using or redistributing generated content.
Model tree for AmitIsraeli/sanavideo-tomjerry-lora-r16-v1
Unable to build the model tree, the base model loops to the model itself. Learn more.






