MotionVQVAE

A Multi-Group Vector Quantized VAE (MG-VQVAE) trained on the MotionMillion dataset for motion tokenization and reconstruction.

โš ๏ธ License Notice: This model is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. This model is for research and non-commercial use only.

๐Ÿ“‹ Body Model: This model was trained on motion data using the SMPL body model (22 joints). Input motions must be in SMPL skeleton format.

Model Description

MotionVQVAE learns to compress human motion sequences into discrete tokens using a Multi-Group Vector Quantization approach. The model can:

  • Tokenize motion sequences into discrete tokens for downstream generation tasks
  • Reconstruct motions from tokens with high fidelity
  • Compress variable-length motions with 4ร— temporal downsampling

Multi-Group VQ Architecture

Instead of a single codebook, MG-VQVAE uses 64 parallel groups, each with its own 512-code codebook. This provides:

  • Effective codebook size: $512^{64} \approx 2.47 \times 10^{173}$ combinations
  • Fine-grained control over different motion aspects
  • Better reconstruction quality through distributed quantization

Usage

Installation

pip install torch huggingface_hub numpy

Download the Model Code

Download motion_vqvae_hf.py from this repository or copy it to your project.

Quick Start

from motion_vqvae_hf import MotionVQVAE
import numpy as np

# Load model (auto-downloads from HuggingFace)
model = MotionVQVAE.from_pretrained("khania/motion-vqvae")

# Prepare motion data (272-dim absolute root format)
motion = np.random.randn(120, 272).astype(np.float32)  # Replace with real motion

# Encode motion to tokens
tokens = model.encode(motion)  # Returns token indices for each group
print(f"Tokens shape: {tokens.shape}")  # (64, 1, 30) - 64 groups, batch=1, T/4 timesteps

# Decode tokens back to motion
motion_recon = model.decode(tokens)
print(f"Reconstructed motion shape: {motion_recon.shape}")  # (1, 120, 272)

# Full forward pass (encode + decode)
motion_recon, tokens = model(motion)

Batch Processing

# Process multiple motions
motions = [
    np.random.randn(100, 272).astype(np.float32),
    np.random.randn(150, 272).astype(np.float32),
    np.random.randn(80, 272).astype(np.float32),
]

# Encode batch (will pad to max length)
tokens = model.encode_batch(motions)

# Decode batch
motions_recon = model.decode_batch(tokens)

Access Codebook

# Get quantized embeddings for analysis
embeddings = model.get_codebook_embeddings()
print(f"Codebook shape: {embeddings.shape}")  # (64, 512, 8) - 64 groups, 512 codes, 8-dim each

Model Architecture

Component Details
Encoder 1D CNN with residual blocks
Decoder 1D CNN with residual blocks
Width 1024
Depth 3 residual blocks per stage
Downsampling 4ร— (stride 2, 2 stages)
Quantizer Multi-Group VQ with EMA updates
Groups 64
Codebook Size 512 codes per group
Code Dimension 8 per group (512 total)
Total Parameters ~73M

Motion Format

The model expects 272-dimensional motion features in absolute root format based on the SMPL body model (22 joints).

SMPL Body Model Requirement

This model was trained exclusively on motion data represented using the SMPL body model. Your input motions must:

  • Use the SMPL skeleton with 22 joints
  • Follow the SMPL joint ordering
  • Be converted to the 272-dimensional HumanML3D-style representation

If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.

Feature Dimensions

Dimensions Description
[0:2] Root XZ velocities
[2:8] Absolute heading rotation (6D representation)
[8:74] Local joint positions (22 joints ร— 3)
[74:140] Local joint velocities (22 joints ร— 3)
[140:272] Joint rotations in 6D (22 joints ร— 6)

The model automatically normalizes input motions using the bundled mean/std statistics.

Training Details

Parameter Value
Dataset MotionMillion
Batch Size 128
Training Iterations 300,000
Learning Rate 2e-4
LR Schedule Step decay at 50K, 400K
Loss Function L1 Smooth + Commitment
Commitment Weight 0.02
Window Size 64 frames

Loss Weights

Component Weight
Root XZ Velocity 3.0
Root Rotation 1.5
Joint Position 0.1
Joint Velocity 0.5
Joint Rotation 5.0
Velocity Temporal 0.5

Performance

Final evaluation metrics at 300K iterations:

Metric Value
Reconstruction Loss 0.0095
Commitment Loss 0.0255
Perplexity 508.57
Codebook Utilization 100%

Per-Component Reconstruction Loss (Eval)

Component Loss
Root XZ Velocity 0.00107
Root Rotation 0.00029
Joint Position 0.00851
Joint Velocity 0.02301
Joint Rotation 0.00383

Files in This Repository

File Size Description
config.json ~300 B Model configuration
pytorch_model.bin ~280 MB Model weights (~73M parameters)
mean.npy 1.2 KB Motion normalization mean (272,)
std.npy 1.2 KB Motion normalization std (272,)
motion_vqvae_hf.py ~20 KB Model implementation

Use Cases

  • Motion Generation: Tokenize motions for autoregressive or diffusion-based generation
  • Motion Compression: Efficiently store motion data as discrete tokens
  • Motion Editing: Manipulate tokens for motion modification
  • Downstream Tasks: Use tokens as input for text-to-motion models

Limitations

  • Trained on English text descriptions only (for associated metadata)
  • Motion format is specific to HumanML3D-style 272-dim representation
  • 4ร— temporal downsampling may lose very fine-grained details
  • Best performance on motions similar to training distribution (daily activities, sports, etc.)

Citation

@article{motionmillion2026,
  title={MotionMillion: A Large-Scale Motion-Language Dataset},
  author={...},
  year={2026}
}

License

CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support