MotionVQVAE

A Multi-Group Vector Quantized VAE (MG-VQVAE) trained on the MotionMillion dataset for motion tokenization and reconstruction.

⚠️ License Notice: This model is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. This model is for research and non-commercial use only.

📋 Body Model: This model was trained on motion data using the SMPL body model (22 joints). Input motions must be in SMPL skeleton format.

Model Description

MotionVQVAE learns to compress human motion sequences into discrete tokens using a Multi-Group Vector Quantization approach. The model can:

Tokenize motion sequences into discrete tokens for downstream generation tasks
Reconstruct motions from tokens with high fidelity
Compress variable-length motions with 4× temporal downsampling

Multi-Group VQ Architecture

Instead of a single codebook, MG-VQVAE uses 64 parallel groups, each with its own 512-code codebook. This provides:

Effective codebook size: $512^{64} \approx 2.47 \times 10^{173}$ combinations
Fine-grained control over different motion aspects
Better reconstruction quality through distributed quantization

Usage

Installation

pip install torch huggingface_hub numpy

Download the Model Code

Download motion_vqvae_hf.py from this repository or copy it to your project.

Quick Start

from motion_vqvae_hf import MotionVQVAE
import numpy as np

# Load model (auto-downloads from HuggingFace)
model = MotionVQVAE.from_pretrained("khania/motion-vqvae")

# Prepare motion data (272-dim absolute root format)
motion = np.random.randn(120, 272).astype(np.float32)  # Replace with real motion

# Encode motion to tokens
tokens = model.encode(motion)  # Returns token indices for each group
print(f"Tokens shape: {tokens.shape}")  # (64, 1, 30) - 64 groups, batch=1, T/4 timesteps

# Decode tokens back to motion
motion_recon = model.decode(tokens)
print(f"Reconstructed motion shape: {motion_recon.shape}")  # (1, 120, 272)

# Full forward pass (encode + decode)
motion_recon, tokens = model(motion)

Batch Processing

# Process multiple motions
motions = [
    np.random.randn(100, 272).astype(np.float32),
    np.random.randn(150, 272).astype(np.float32),
    np.random.randn(80, 272).astype(np.float32),
]

# Encode batch (will pad to max length)
tokens = model.encode_batch(motions)

# Decode batch
motions_recon = model.decode_batch(tokens)

Access Codebook

# Get quantized embeddings for analysis
embeddings = model.get_codebook_embeddings()
print(f"Codebook shape: {embeddings.shape}")  # (64, 512, 8) - 64 groups, 512 codes, 8-dim each

Model Architecture

Component	Details
Encoder	1D CNN with residual blocks
Decoder	1D CNN with residual blocks
Width	1024
Depth	3 residual blocks per stage
Downsampling	4× (stride 2, 2 stages)
Quantizer	Multi-Group VQ with EMA updates
Groups	64
Codebook Size	512 codes per group
Code Dimension	8 per group (512 total)
Total Parameters	~73M

Motion Format

The model expects 272-dimensional motion features in absolute root format based on the SMPL body model (22 joints).

SMPL Body Model Requirement

This model was trained exclusively on motion data represented using the SMPL body model. Your input motions must:

Use the SMPL skeleton with 22 joints
Follow the SMPL joint ordering
Be converted to the 272-dimensional HumanML3D-style representation

If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.

Feature Dimensions

Dimensions	Description
`[0:2]`	Root XZ velocities
`[2:8]`	Absolute heading rotation (6D representation)
`[8:74]`	Local joint positions (22 joints × 3)
`[74:140]`	Local joint velocities (22 joints × 3)
`[140:272]`	Joint rotations in 6D (22 joints × 6)

The model automatically normalizes input motions using the bundled mean/std statistics.

Training Details

Parameter	Value
Dataset	MotionMillion
Batch Size	128
Training Iterations	300,000
Learning Rate	2e-4
LR Schedule	Step decay at 50K, 400K
Loss Function	L1 Smooth + Commitment
Commitment Weight	0.02
Window Size	64 frames

Loss Weights

Component	Weight
Root XZ Velocity	3.0
Root Rotation	1.5
Joint Position	0.1
Joint Velocity	0.5
Joint Rotation	5.0
Velocity Temporal	0.5

Performance

Final evaluation metrics at 300K iterations:

Metric	Value
Reconstruction Loss	0.0095
Commitment Loss	0.0255
Perplexity	508.57
Codebook Utilization	100%

Per-Component Reconstruction Loss (Eval)

Component	Loss
Root XZ Velocity	0.00107
Root Rotation	0.00029
Joint Position	0.00851
Joint Velocity	0.02301
Joint Rotation	0.00383

Files in This Repository

File	Size	Description
`config.json`	~300 B	Model configuration
`pytorch_model.bin`	~280 MB	Model weights (~73M parameters)
`mean.npy`	1.2 KB	Motion normalization mean (272,)
`std.npy`	1.2 KB	Motion normalization std (272,)
`motion_vqvae_hf.py`	~20 KB	Model implementation

Use Cases

Motion Generation: Tokenize motions for autoregressive or diffusion-based generation
Motion Compression: Efficiently store motion data as discrete tokens
Motion Editing: Manipulate tokens for motion modification
Downstream Tasks: Use tokens as input for text-to-motion models

Limitations

Trained on English text descriptions only (for associated metadata)
Motion format is specific to HumanML3D-style 272-dim representation
4× temporal downsampling may lose very fine-grained details
Best performance on motions similar to training distribution (daily activities, sports, etc.)

Citation

@article{motionmillion2026,
  title={MotionMillion: A Large-Scale Motion-Language Dataset},
  author={...},
  year={2026}
}

License

CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)

Downloads last month: -