kimodo-api
REST API microservice wrapper around NVIDIA Kimodo — text-to-motion diffusion model generating 77-joint SOMA skeleton motion from natural language prompts.
Installation
docker pull ghcr.io/eyalenav/kimodo-api:latest
Run
docker run --rm \
--gpus '"device=0"' \
-p 9551:9551 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGINGFACE_TOKEN=hf_... \
ghcr.io/eyalenav/kimodo-api:latest
First run: downloads Llama-3-8B-Instruct (~16 GB) and Kimodo weights. Subsequent starts are fast (weights cached in
/root/.cache/huggingface).
API Reference
GET /health
Check server status.
Request
GET http://localhost:9551/health
Response
{
"status": "ok"
}
POST /generate
Generate a motion clip from a text prompt.
Request
POST http://localhost:9551/generate
Content-Type: application/json
{
"prompt": "person pushing through a crowd aggressively",
"num_frames": 120,
"fps": 30
}
| Field | Type | Default | Description |
|---|---|---|---|
prompt |
string | required | Natural language motion description |
num_frames |
int | 120 |
Number of frames to generate |
fps |
int | 30 |
Frames per second (metadata only) |
Response
Binary NPZ file (application/octet-stream).
The NPZ contains:
| Key | Shape | Description |
|---|---|---|
poses |
(T, 77, 3) |
Joint rotations (axis-angle) per frame |
trans |
(T, 3) |
Root translation per frame |
betas |
(16,) |
SMPL body shape parameters |
Example
curl -X POST http://localhost:9551/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "person falling to the ground after being pushed"}' \
--output output_motion.npz
POST /generate_bvh
Generate motion and return as BVH (Biovision Hierarchy) format.
Request
POST http://localhost:9551/generate_bvh
Content-Type: application/json
{
"prompt": "two people fighting, punches thrown",
"num_frames": 150
}
Response
BVH text file (text/plain).
Example
curl -X POST http://localhost:9551/generate_bvh \
-H "Content-Type: application/json" \
-d '{"prompt": "drunk person stumbling and falling"}' \
--output output_motion.bvh
Hardware Requirements
| Resource | Minimum | Recommended |
|---|---|---|
| GPU | RTX 3090 (24 GB VRAM) | RTX 6000 Ada / A100 |
| VRAM | 24 GB | 48 GB |
| RAM | 32 GB | 64 GB |
| Disk | 50 GB | 100 GB |
| CUDA | 12.1+ | 12.8 |
Environment Variables
| Variable | Required | Description |
|---|---|---|
HUGGINGFACE_TOKEN |
Yes | HF token with access to meta-llama/Meta-Llama-3-8B-Instruct |
CUDA_VISIBLE_DEVICES |
No | Limit to specific GPU (e.g. "0") |
PORT |
No | Override default port 9551 |
Integration with VisionAI-Flywheel
kimodo-api is designed to run alongside render-api and cosmos-transfer as part of the full pipeline:
# docker-compose.yml excerpt
services:
kimodo-api:
image: ghcr.io/eyalenav/kimodo-api:latest
ports:
- "9551:9551"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
volumes:
- hf_cache:/root/.cache/huggingface
environment:
- HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN}
Full docker-compose.yml: github.com/EyalEnav/VisionAI-Flywheel
Example: Full Python client
import requests
import numpy as np
import io
def generate_motion(prompt: str, num_frames: int = 120) -> dict:
"""Generate motion NPZ from text prompt."""
response = requests.post(
"http://localhost:9551/generate",
json={"prompt": prompt, "num_frames": num_frames},
timeout=120
)
response.raise_for_status()
npz = np.load(io.BytesIO(response.content))
return {
"poses": npz["poses"], # (T, 77, 3)
"trans": npz["trans"], # (T, 3)
"betas": npz["betas"], # (16,)
}
# Example usage
motion = generate_motion("security guard running toward an incident")
print(f"Generated {motion['poses'].shape[0]} frames")
License
Apache 2.0
Kimodo model weights are released under the NVIDIA Open Model License. Weights are downloaded at runtime and are not bundled in this image.