kimodo-api

REST API microservice wrapper around NVIDIA Kimodo — text-to-motion diffusion model generating 77-joint SOMA skeleton motion from natural language prompts.

Installation

docker pull ghcr.io/eyalenav/kimodo-api:latest

Run

docker run --rm \
  --gpus '"device=0"' \
  -p 9551:9551 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGINGFACE_TOKEN=hf_... \
  ghcr.io/eyalenav/kimodo-api:latest

First run: downloads Llama-3-8B-Instruct (~16 GB) and Kimodo weights. Subsequent starts are fast (weights cached in /root/.cache/huggingface).

API Reference

`GET /health`

Check server status.

Request

GET http://localhost:9551/health

Response

{
  "status": "ok"
}

`POST /generate`

Generate a motion clip from a text prompt.

Request

POST http://localhost:9551/generate
Content-Type: application/json

{
  "prompt": "person pushing through a crowd aggressively",
  "num_frames": 120,
  "fps": 30
}

Field	Type	Default	Description
`prompt`	string	required	Natural language motion description
`num_frames`	int	`120`	Number of frames to generate
`fps`	int	`30`	Frames per second (metadata only)

Response

Binary NPZ file (application/octet-stream).

The NPZ contains:

Key	Shape	Description
`poses`	`(T, 77, 3)`	Joint rotations (axis-angle) per frame
`trans`	`(T, 3)`	Root translation per frame
`betas`	`(16,)`	SMPL body shape parameters

Example

curl -X POST http://localhost:9551/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "person falling to the ground after being pushed"}' \
  --output output_motion.npz

`POST /generate_bvh`

Generate motion and return as BVH (Biovision Hierarchy) format.

Request

POST http://localhost:9551/generate_bvh
Content-Type: application/json

{
  "prompt": "two people fighting, punches thrown",
  "num_frames": 150
}

Response

BVH text file (text/plain).

Example

curl -X POST http://localhost:9551/generate_bvh \
  -H "Content-Type: application/json" \
  -d '{"prompt": "drunk person stumbling and falling"}' \
  --output output_motion.bvh

Hardware Requirements

Resource	Minimum	Recommended
GPU	RTX 3090 (24 GB VRAM)	RTX 6000 Ada / A100
VRAM	24 GB	48 GB
RAM	32 GB	64 GB
Disk	50 GB	100 GB
CUDA	12.1+	12.8

Environment Variables

Variable	Required	Description
`HUGGINGFACE_TOKEN`	Yes	HF token with access to `meta-llama/Meta-Llama-3-8B-Instruct`
`CUDA_VISIBLE_DEVICES`	No	Limit to specific GPU (e.g. `"0"`)
`PORT`	No	Override default port `9551`

Integration with VisionAI-Flywheel

kimodo-api is designed to run alongside render-api and cosmos-transfer as part of the full pipeline:

# docker-compose.yml excerpt
services:
  kimodo-api:
    image: ghcr.io/eyalenav/kimodo-api:latest
    ports:
      - "9551:9551"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    volumes:
      - hf_cache:/root/.cache/huggingface
    environment:
      - HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN}

Full docker-compose.yml: github.com/EyalEnav/VisionAI-Flywheel

Example: Full Python client

import requests
import numpy as np
import io

def generate_motion(prompt: str, num_frames: int = 120) -> dict:
    """Generate motion NPZ from text prompt."""
    response = requests.post(
        "http://localhost:9551/generate",
        json={"prompt": prompt, "num_frames": num_frames},
        timeout=120
    )
    response.raise_for_status()
    
    npz = np.load(io.BytesIO(response.content))
    return {
        "poses": npz["poses"],   # (T, 77, 3)
        "trans": npz["trans"],   # (T, 3)
        "betas": npz["betas"],   # (16,)
    }

# Example usage
motion = generate_motion("security guard running toward an incident")
print(f"Generated {motion['poses'].shape[0]} frames")

License

Apache 2.0

Kimodo model weights are released under the NVIDIA Open Model License. Weights are downloaded at runtime and are not bundled in this image.