MotionCLIP

A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding.

⚠️ License Notice: This model is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. This model is for research and non-commercial use only.

📋 Body Model: This model was trained on motion data using the SMPL body model (22 joints). Input motions must be in SMPL skeleton format.

Model Description

MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can:

Retrieve the most relevant text for a motion (and vice versa)
Classify motions in a zero-shot manner using text labels
Compute similarity between motions and text descriptions

Usage

Installation

pip install torch transformers huggingface_hub numpy

Download the Model Code

Download motion_clip_hf.py from this repository or copy it to your project.

Quick Start

from motion_clip_hf import MotionCLIP
import numpy as np

# Load model (auto-downloads from HuggingFace)
model = MotionCLIP.from_pretrained("khania/motion-clip")

# Encode text
text_emb = model.encode_text(["a person walks forward", "someone is running fast"])
print(f"Text embeddings: {text_emb.shape}")  # (2, 512)

# Encode motion (272-dim absolute root format, variable length)
motion = np.random.randn(120, 272).astype(np.float32)  # Replace with real motion
motion_emb = model.encode_motion(motion)
print(f"Motion embedding: {motion_emb.shape}")  # (512,)

# Compute similarity
similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"])
predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()]
print(f"Predicted action: {predicted}")

Text-to-Motion Retrieval

# Find most similar motions for a text query
results = model.retrieve_motion(
    text="a person waves their hand",
    candidate_motions=[motion1, motion2, motion3],  # List of (T, 272) arrays
    top_k=3
)
for r in results:
    print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})")

Motion-to-Text Retrieval

# Find most similar texts for a motion
results = model.retrieve_text(
    motion=my_motion,  # (T, 272) array
    candidate_texts=["walking", "running", "jumping", "waving", "sitting"],
    top_k=3
)
for r in results:
    print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})")

Zero-Shot Motion Classification

# Define action categories
actions = ["walking", "running", "jumping", "sitting", "waving", 
           "kicking", "punching", "dancing", "stretching", "bowing"]

# Classify a motion
similarity = model.compute_similarity(motion, actions)
predicted_action = actions[similarity.argmax()]
confidence = similarity.max()
print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})")

Model Architecture

Component	Details
Motion Encoder	8-layer Transformer
Hidden Dimension	768
Attention Heads	12
Text Encoder	CLIP ViT-B/32 (fine-tuned)
Embedding Dimension	512
Max Sequence Length	1024 frames

Motion Format

The model expects 272-dimensional motion features in absolute root format based on the SMPL body model (22 joints).

SMPL Body Model Requirement

This model was trained exclusively on motion data represented using the SMPL body model. Your input motions must:

Use the SMPL skeleton with 22 joints
Follow the SMPL joint ordering
Be converted to the 272-dimensional HumanML3D-style representation

If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.

Feature Dimensions

Dimensions	Description
`[0:2]`	Root XZ velocities
`[2:8]`	Absolute heading rotation (6D representation)
`[8:74]`	Local joint positions (22 joints × 3)
`[74:140]`	Local joint velocities (22 joints × 3)
`[140:272]`	Joint rotations in 6D (22 joints × 6)

The model automatically normalizes input motions using the bundled mean/std statistics.

Training Details

Parameter	Value
Dataset	MotionMillion (~884K training motions)
Batch Size	256
Training Iterations	100,000
Learning Rate (Motion Encoder)	1e-4
Learning Rate (Text Encoder)	5e-5
Loss Function	Symmetric InfoNCE
Temperature	Learnable (initialized at 0.07)

Performance

Retrieval performance (R@k) on random test subsets:

Subset Size	Motion→Text R@1	Motion→Text R@5	Text→Motion R@1	Text→Motion R@5
1,000	36.2%	67.8%	36.4%	68.1%
5,000	17.7%	42.1%	17.8%	42.3%
10,000	12.4%	31.5%	12.5%	31.6%

Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.

Files in This Repository

File	Size	Description
`config.json`	239 B	Model configuration
`pytorch_model.bin`	219 MB	Model weights
`mean.npy`	1.2 KB	Motion normalization mean (272,)
`std.npy`	1.2 KB	Motion normalization std (272,)

Limitations

Trained on English text descriptions only
Motion format is specific to HumanML3D-style 272-dim representation
Best performance on motions similar to training distribution (daily activities, sports, etc.)

Citation

@article{motionmillion2026,
  title={MotionMillion: A Large-Scale Motion-Language Dataset},
  author={...},
  year={2026}
}

License

CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)

This model is released for research and non-commercial use only.

Why Non-Commercial?

The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses:

Some datasets permit commercial use
Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases)

To comply with the most restrictive terms, this model is released under CC BY-NC 4.0.

What This Means

✅ Allowed:

Academic research
Personal projects
Non-commercial applications
Sharing and adapting with attribution

❌ Not Allowed:

Commercial products or services
Selling access to the model
Using in revenue-generating applications

For commercial licensing inquiries, please contact the authors.

Downloads last month: 22