MotionVQVAE
A Multi-Group Vector Quantized VAE (MG-VQVAE) trained on the MotionMillion dataset for motion tokenization and reconstruction.
โ ๏ธ License Notice: This model is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. This model is for research and non-commercial use only.
๐ Body Model: This model was trained on motion data using the SMPL body model (22 joints). Input motions must be in SMPL skeleton format.
Model Description
MotionVQVAE learns to compress human motion sequences into discrete tokens using a Multi-Group Vector Quantization approach. The model can:
- Tokenize motion sequences into discrete tokens for downstream generation tasks
- Reconstruct motions from tokens with high fidelity
- Compress variable-length motions with 4ร temporal downsampling
Multi-Group VQ Architecture
Instead of a single codebook, MG-VQVAE uses 64 parallel groups, each with its own 512-code codebook. This provides:
- Effective codebook size: $512^{64} \approx 2.47 \times 10^{173}$ combinations
- Fine-grained control over different motion aspects
- Better reconstruction quality through distributed quantization
Usage
Installation
pip install torch huggingface_hub numpy
Download the Model Code
Download motion_vqvae_hf.py from this repository or copy it to your project.
Quick Start
from motion_vqvae_hf import MotionVQVAE
import numpy as np
# Load model (auto-downloads from HuggingFace)
model = MotionVQVAE.from_pretrained("khania/motion-vqvae")
# Prepare motion data (272-dim absolute root format)
motion = np.random.randn(120, 272).astype(np.float32) # Replace with real motion
# Encode motion to tokens
tokens = model.encode(motion) # Returns token indices for each group
print(f"Tokens shape: {tokens.shape}") # (64, 1, 30) - 64 groups, batch=1, T/4 timesteps
# Decode tokens back to motion
motion_recon = model.decode(tokens)
print(f"Reconstructed motion shape: {motion_recon.shape}") # (1, 120, 272)
# Full forward pass (encode + decode)
motion_recon, tokens = model(motion)
Batch Processing
# Process multiple motions
motions = [
np.random.randn(100, 272).astype(np.float32),
np.random.randn(150, 272).astype(np.float32),
np.random.randn(80, 272).astype(np.float32),
]
# Encode batch (will pad to max length)
tokens = model.encode_batch(motions)
# Decode batch
motions_recon = model.decode_batch(tokens)
Access Codebook
# Get quantized embeddings for analysis
embeddings = model.get_codebook_embeddings()
print(f"Codebook shape: {embeddings.shape}") # (64, 512, 8) - 64 groups, 512 codes, 8-dim each
Model Architecture
| Component | Details |
|---|---|
| Encoder | 1D CNN with residual blocks |
| Decoder | 1D CNN with residual blocks |
| Width | 1024 |
| Depth | 3 residual blocks per stage |
| Downsampling | 4ร (stride 2, 2 stages) |
| Quantizer | Multi-Group VQ with EMA updates |
| Groups | 64 |
| Codebook Size | 512 codes per group |
| Code Dimension | 8 per group (512 total) |
| Total Parameters | ~73M |
Motion Format
The model expects 272-dimensional motion features in absolute root format based on the SMPL body model (22 joints).
SMPL Body Model Requirement
This model was trained exclusively on motion data represented using the SMPL body model. Your input motions must:
- Use the SMPL skeleton with 22 joints
- Follow the SMPL joint ordering
- Be converted to the 272-dimensional HumanML3D-style representation
If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.
Feature Dimensions
| Dimensions | Description |
|---|---|
[0:2] |
Root XZ velocities |
[2:8] |
Absolute heading rotation (6D representation) |
[8:74] |
Local joint positions (22 joints ร 3) |
[74:140] |
Local joint velocities (22 joints ร 3) |
[140:272] |
Joint rotations in 6D (22 joints ร 6) |
The model automatically normalizes input motions using the bundled mean/std statistics.
Training Details
| Parameter | Value |
|---|---|
| Dataset | MotionMillion |
| Batch Size | 128 |
| Training Iterations | 300,000 |
| Learning Rate | 2e-4 |
| LR Schedule | Step decay at 50K, 400K |
| Loss Function | L1 Smooth + Commitment |
| Commitment Weight | 0.02 |
| Window Size | 64 frames |
Loss Weights
| Component | Weight |
|---|---|
| Root XZ Velocity | 3.0 |
| Root Rotation | 1.5 |
| Joint Position | 0.1 |
| Joint Velocity | 0.5 |
| Joint Rotation | 5.0 |
| Velocity Temporal | 0.5 |
Performance
Final evaluation metrics at 300K iterations:
| Metric | Value |
|---|---|
| Reconstruction Loss | 0.0095 |
| Commitment Loss | 0.0255 |
| Perplexity | 508.57 |
| Codebook Utilization | 100% |
Per-Component Reconstruction Loss (Eval)
| Component | Loss |
|---|---|
| Root XZ Velocity | 0.00107 |
| Root Rotation | 0.00029 |
| Joint Position | 0.00851 |
| Joint Velocity | 0.02301 |
| Joint Rotation | 0.00383 |
Files in This Repository
| File | Size | Description |
|---|---|---|
config.json |
~300 B | Model configuration |
pytorch_model.bin |
~280 MB | Model weights (~73M parameters) |
mean.npy |
1.2 KB | Motion normalization mean (272,) |
std.npy |
1.2 KB | Motion normalization std (272,) |
motion_vqvae_hf.py |
~20 KB | Model implementation |
Use Cases
- Motion Generation: Tokenize motions for autoregressive or diffusion-based generation
- Motion Compression: Efficiently store motion data as discrete tokens
- Motion Editing: Manipulate tokens for motion modification
- Downstream Tasks: Use tokens as input for text-to-motion models
Limitations
- Trained on English text descriptions only (for associated metadata)
- Motion format is specific to HumanML3D-style 272-dim representation
- 4ร temporal downsampling may lose very fine-grained details
- Best performance on motions similar to training distribution (daily activities, sports, etc.)
Citation
@article{motionmillion2026,
title={MotionMillion: A Large-Scale Motion-Language Dataset},
author={...},
year={2026}
}
License
CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
- Downloads last month
- -