MotionCLIP
A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding.
⚠️ License Notice: This model is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. This model is for research and non-commercial use only.
📋 Body Model: This model was trained on motion data using the SMPL body model (22 joints). Input motions must be in SMPL skeleton format.
Model Description
MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can:
- Retrieve the most relevant text for a motion (and vice versa)
- Classify motions in a zero-shot manner using text labels
- Compute similarity between motions and text descriptions
Usage
Installation
pip install torch transformers huggingface_hub numpy
Download the Model Code
Download motion_clip_hf.py from this repository or copy it to your project.
Quick Start
from motion_clip_hf import MotionCLIP
import numpy as np
# Load model (auto-downloads from HuggingFace)
model = MotionCLIP.from_pretrained("khania/motion-clip")
# Encode text
text_emb = model.encode_text(["a person walks forward", "someone is running fast"])
print(f"Text embeddings: {text_emb.shape}") # (2, 512)
# Encode motion (272-dim absolute root format, variable length)
motion = np.random.randn(120, 272).astype(np.float32) # Replace with real motion
motion_emb = model.encode_motion(motion)
print(f"Motion embedding: {motion_emb.shape}") # (512,)
# Compute similarity
similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"])
predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()]
print(f"Predicted action: {predicted}")
Text-to-Motion Retrieval
# Find most similar motions for a text query
results = model.retrieve_motion(
text="a person waves their hand",
candidate_motions=[motion1, motion2, motion3], # List of (T, 272) arrays
top_k=3
)
for r in results:
print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})")
Motion-to-Text Retrieval
# Find most similar texts for a motion
results = model.retrieve_text(
motion=my_motion, # (T, 272) array
candidate_texts=["walking", "running", "jumping", "waving", "sitting"],
top_k=3
)
for r in results:
print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})")
Zero-Shot Motion Classification
# Define action categories
actions = ["walking", "running", "jumping", "sitting", "waving",
"kicking", "punching", "dancing", "stretching", "bowing"]
# Classify a motion
similarity = model.compute_similarity(motion, actions)
predicted_action = actions[similarity.argmax()]
confidence = similarity.max()
print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})")
Model Architecture
| Component | Details |
|---|---|
| Motion Encoder | 8-layer Transformer |
| Hidden Dimension | 768 |
| Attention Heads | 12 |
| Text Encoder | CLIP ViT-B/32 (fine-tuned) |
| Embedding Dimension | 512 |
| Max Sequence Length | 1024 frames |
Motion Format
The model expects 272-dimensional motion features in absolute root format based on the SMPL body model (22 joints).
SMPL Body Model Requirement
This model was trained exclusively on motion data represented using the SMPL body model. Your input motions must:
- Use the SMPL skeleton with 22 joints
- Follow the SMPL joint ordering
- Be converted to the 272-dimensional HumanML3D-style representation
If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.
Feature Dimensions
| Dimensions | Description |
|---|---|
[0:2] |
Root XZ velocities |
[2:8] |
Absolute heading rotation (6D representation) |
[8:74] |
Local joint positions (22 joints × 3) |
[74:140] |
Local joint velocities (22 joints × 3) |
[140:272] |
Joint rotations in 6D (22 joints × 6) |
The model automatically normalizes input motions using the bundled mean/std statistics.
Training Details
| Parameter | Value |
|---|---|
| Dataset | MotionMillion (~884K training motions) |
| Batch Size | 256 |
| Training Iterations | 100,000 |
| Learning Rate (Motion Encoder) | 1e-4 |
| Learning Rate (Text Encoder) | 5e-5 |
| Loss Function | Symmetric InfoNCE |
| Temperature | Learnable (initialized at 0.07) |
Performance
Retrieval performance (R@k) on random test subsets:
| Subset Size | Motion→Text R@1 | Motion→Text R@5 | Text→Motion R@1 | Text→Motion R@5 |
|---|---|---|---|---|
| 1,000 | 36.2% | 67.8% | 36.4% | 68.1% |
| 5,000 | 17.7% | 42.1% | 17.8% | 42.3% |
| 10,000 | 12.4% | 31.5% | 12.5% | 31.6% |
Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.
Files in This Repository
| File | Size | Description |
|---|---|---|
config.json |
239 B | Model configuration |
pytorch_model.bin |
219 MB | Model weights |
mean.npy |
1.2 KB | Motion normalization mean (272,) |
std.npy |
1.2 KB | Motion normalization std (272,) |
Limitations
- Trained on English text descriptions only
- Motion format is specific to HumanML3D-style 272-dim representation
- Best performance on motions similar to training distribution (daily activities, sports, etc.)
Citation
@article{motionmillion2026,
title={MotionMillion: A Large-Scale Motion-Language Dataset},
author={...},
year={2026}
}
License
CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)
This model is released for research and non-commercial use only.
Why Non-Commercial?
The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses:
- Some datasets permit commercial use
- Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases)
To comply with the most restrictive terms, this model is released under CC BY-NC 4.0.
What This Means
✅ Allowed:
- Academic research
- Personal projects
- Non-commercial applications
- Sharing and adapting with attribution
❌ Not Allowed:
- Commercial products or services
- Selling access to the model
- Using in revenue-generating applications
For commercial licensing inquiries, please contact the authors.
- Downloads last month
- 22