MotionCLIP

A Motion-Text CLIP model trained on the MotionMillion dataset for motion-text retrieval, zero-shot motion classification, and motion understanding.

⚠️ License Notice: This model is released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0). The training data includes datasets with mixed licensing terms, some of which restrict commercial use. This model is for research and non-commercial use only.

📋 Body Model: This model was trained on motion data using the SMPL body model (22 joints). Input motions must be in SMPL skeleton format.

Model Description

MotionCLIP learns a joint embedding space between human motion sequences and natural language descriptions. Given a motion sequence (272-dimensional features per frame) and text descriptions, the model can:

  • Retrieve the most relevant text for a motion (and vice versa)
  • Classify motions in a zero-shot manner using text labels
  • Compute similarity between motions and text descriptions

Usage

Installation

pip install torch transformers huggingface_hub numpy

Download the Model Code

Download motion_clip_hf.py from this repository or copy it to your project.

Quick Start

from motion_clip_hf import MotionCLIP
import numpy as np

# Load model (auto-downloads from HuggingFace)
model = MotionCLIP.from_pretrained("khania/motion-clip")

# Encode text
text_emb = model.encode_text(["a person walks forward", "someone is running fast"])
print(f"Text embeddings: {text_emb.shape}")  # (2, 512)

# Encode motion (272-dim absolute root format, variable length)
motion = np.random.randn(120, 272).astype(np.float32)  # Replace with real motion
motion_emb = model.encode_motion(motion)
print(f"Motion embedding: {motion_emb.shape}")  # (512,)

# Compute similarity
similarity = model.compute_similarity(motion, ["walking", "running", "jumping", "sitting"])
predicted = ["walking", "running", "jumping", "sitting"][similarity.argmax()]
print(f"Predicted action: {predicted}")

Text-to-Motion Retrieval

# Find most similar motions for a text query
results = model.retrieve_motion(
    text="a person waves their hand",
    candidate_motions=[motion1, motion2, motion3],  # List of (T, 272) arrays
    top_k=3
)
for r in results:
    print(f"#{r['rank']}: Motion {r['index']} (score: {r['score']:.4f})")

Motion-to-Text Retrieval

# Find most similar texts for a motion
results = model.retrieve_text(
    motion=my_motion,  # (T, 272) array
    candidate_texts=["walking", "running", "jumping", "waving", "sitting"],
    top_k=3
)
for r in results:
    print(f"#{r['rank']}: {r['text']} (score: {r['score']:.4f})")

Zero-Shot Motion Classification

# Define action categories
actions = ["walking", "running", "jumping", "sitting", "waving", 
           "kicking", "punching", "dancing", "stretching", "bowing"]

# Classify a motion
similarity = model.compute_similarity(motion, actions)
predicted_action = actions[similarity.argmax()]
confidence = similarity.max()
print(f"Predicted: {predicted_action} (confidence: {confidence:.3f})")

Model Architecture

Component Details
Motion Encoder 8-layer Transformer
Hidden Dimension 768
Attention Heads 12
Text Encoder CLIP ViT-B/32 (fine-tuned)
Embedding Dimension 512
Max Sequence Length 1024 frames

Motion Format

The model expects 272-dimensional motion features in absolute root format based on the SMPL body model (22 joints).

SMPL Body Model Requirement

This model was trained exclusively on motion data represented using the SMPL body model. Your input motions must:

  • Use the SMPL skeleton with 22 joints
  • Follow the SMPL joint ordering
  • Be converted to the 272-dimensional HumanML3D-style representation

If your motion data uses a different skeleton (e.g., CMU, Mixamo, custom rigs), you must first retarget it to SMPL before using this model.

Feature Dimensions

Dimensions Description
[0:2] Root XZ velocities
[2:8] Absolute heading rotation (6D representation)
[8:74] Local joint positions (22 joints × 3)
[74:140] Local joint velocities (22 joints × 3)
[140:272] Joint rotations in 6D (22 joints × 6)

The model automatically normalizes input motions using the bundled mean/std statistics.

Training Details

Parameter Value
Dataset MotionMillion (~884K training motions)
Batch Size 256
Training Iterations 100,000
Learning Rate (Motion Encoder) 1e-4
Learning Rate (Text Encoder) 5e-5
Loss Function Symmetric InfoNCE
Temperature Learnable (initialized at 0.07)

Performance

Retrieval performance (R@k) on random test subsets:

Subset Size Motion→Text R@1 Motion→Text R@5 Text→Motion R@1 Text→Motion R@5
1,000 36.2% 67.8% 36.4% 68.1%
5,000 17.7% 42.1% 17.8% 42.3%
10,000 12.4% 31.5% 12.5% 31.6%

Note: Lower R@k on larger subsets is expected as the retrieval task becomes harder.

Files in This Repository

File Size Description
config.json 239 B Model configuration
pytorch_model.bin 219 MB Model weights
mean.npy 1.2 KB Motion normalization mean (272,)
std.npy 1.2 KB Motion normalization std (272,)

Limitations

  • Trained on English text descriptions only
  • Motion format is specific to HumanML3D-style 272-dim representation
  • Best performance on motions similar to training distribution (daily activities, sports, etc.)

Citation

@article{motionmillion2026,
  title={MotionMillion: A Large-Scale Motion-Language Dataset},
  author={...},
  year={2026}
}

License

CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International)

This model is released for research and non-commercial use only.

Why Non-Commercial?

The MotionMillion training dataset aggregates motion data from multiple sources with varying licenses:

  • Some datasets permit commercial use
  • Some datasets restrict commercial use (e.g., AMASS, BABEL, certain MoCap databases)

To comply with the most restrictive terms, this model is released under CC BY-NC 4.0.

What This Means

Allowed:

  • Academic research
  • Personal projects
  • Non-commercial applications
  • Sharing and adapting with attribution

Not Allowed:

  • Commercial products or services
  • Selling access to the model
  • Using in revenue-generating applications

For commercial licensing inquiries, please contact the authors.

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support