SANA-Video Tom & Jerry LoRA (rank 16)

GitHub – Sana-Simplified Weights & Biases – Tom & Jerry Runs Project Site – Tom & Jerry Research Log

LoRA adapter for SANA-Video 2B fine-tuned on a small dataset of Tom & Jerry style clips at 224×224 resolution. The goal is to adapt the base 480p SANA-Video model toward classic 2D slapstick cartoon style while keeping its text-conditioning ability.

  • Base model: Efficient-Large-Model/SANA-Video_2B_480p_diffusers
  • Task: class-style video generation ("Tom & Jerry" slapstick cartoon)
  • Backbone: SANA-Video transformer only (VAE + text encoder frozen)
  • Fine-tuning: LoRA on attention / MLP layers
  • Resolution: 224×224 crops (clips ~5 seconds @ 16 FPS)

Checkpoints

This LoRA was trained step-wise and evaluated by generating the same prompt at different training steps.

Checkpoint step Filename Preview
base (0) base model Base
100 lora_step_000100.pt Step 100
1,000 lora_step_001000.pt Step 1000
2,000 lora_step_002000.pt Step 2000
5,000 lora_step_005000.pt Step 5000
7,500 lora_step_007500.pt Step 7500
10,000 lora_step_010000.pt Step 10000

Each of these checkpoints has a short GIF preview generated from the same prompt and seed to visualize how style evolves over training.

Usage (Diffusers + PEFT)

This repo exposes raw LoRA checkpoints (lora_step_*.pt).
Here is a minimal example that:

  • Loads the base SANA-Video 2B 480p pipeline.
  • Wraps the transformer with a LoRA adapter (same config as training).
  • Loads a single LoRA checkpoint and runs one prompt.
from pathlib import Path
import torch
from diffusers import SanaVideoPipeline
from diffusers.utils import export_to_video
from peft import LoraConfig, get_peft_model
from peft.utils import set_peft_model_state_dict

# ---- Paths / config ----
MODEL_ID  = "Efficient-Large-Model/SANA-Video_2B_480p_diffusers"
LORA_PATH = Path("lora_step_010000.pt")  # example checkpoint

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE  = torch.bfloat16 if torch.cuda.is_available() else torch.float32

LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.0
LORA_TARGET_MODULES = ["proj_out", "to_q", "to_v", "to_k", "linear_2", "linear_1", "linear"]

PROMPT = (
    "A vintage slapstick 2D cartoon scene of a grey cat chasing a small brown mouse "
    "in a colorful house, Tom and Jerry style, bold outlines, limited color palette, "
    "exaggerated expressions, smooth character motion."
)

# ---- Load base pipeline ----
pipe = SanaVideoPipeline.from_pretrained(
    MODEL_ID,
    torch_dtype=DTYPE,
)

pipe.vae.to(DEVICE, dtype=torch.float32)  # VAE in fp32 is more stable

# ---- Wrap transformer with LoRA ----
lora_cfg = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    target_modules=LORA_TARGET_MODULES,
)
pipe.transformer = get_peft_model(pipe.transformer, lora_cfg)

# ---- Load LoRA weights (handles torch.compile prefixes) ----
state = torch.load(LORA_PATH, map_location="cpu")
fixed_state = {}
for k, v in state.items():
    if k.startswith("_orig_mod."):
        k = k[len("_orig_mod."):]
    if k.startswith("module."):
        k = k[len("module."):]
    fixed_state[k] = v

set_peft_model_state_dict(pipe.transformer, fixed_state)

# ---- Move to device ----
pipe.to(DEVICE)
pipe.transformer.to(DEVICE, dtype=DTYPE)
pipe.text_encoder.to(DEVICE, dtype=DTYPE)

# ---- Inference ----
with torch.no_grad():
    out = pipe(
        prompt=[PROMPT],
        num_inference_steps=50,
        guidance_scale=4.0,
        height=224,
        width=224,
        use_resolution_binning=False,
    )

video_frames = out.frames[0] if hasattr(out, "frames") else out.videos[0]
export_to_video(video_frames, "tomjerry_lora_sample.mp4", fps=16)
print("Saved to tomjerry_lora_sample.mp4")

Training setup

  • Objective: Rectified Flow / Flow-Matching on latents.
  • Latents: pre-encoded video latents from SANA-Video VAE.
  • Frozen modules: VAE + text encoder.
  • Trainable: transformer via LoRA.

LoRA config

  • r = 16
  • alpha = 32
  • dropout = 0.1
  • Target modules: proj_out, to_q, to_k, to_v, linear_1, linear_2, linear.

Optimization

  • Optimizer: AdamW / AdamW8bit (bitsandbytes).
  • Learning rate: 2e-4.
  • Batch size: 8 videos (5s @ 16 FPS, 224×224).
  • LR schedule: linear warmup then cosine decay.
  • Precision: bf16 (torch.set_float32_matmul_precision("medium")).
  • Training length: up to ~10k effective steps for this checkpoint.

Limitations & notes

  • This LoRA is trained on a small, stylized dataset, so it strongly biases toward a cat–mouse slapstick setup. Out-of-domain prompts may produce strange results.
  • Not an official Tom and Jerry model; this is a research experiment exploring class-style adaptation and residual text control on SANA-Video.
  • Use responsibly and respect relevant copyright when using or redistributing generated content.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AmitIsraeli/sanavideo-tomjerry-lora-r16-v1

Unable to build the model tree, the base model loops to the model itself. Learn more.