File size: 4,589 Bytes

2cb21d2

# kimodo-api

REST API microservice wrapper around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) — text-to-motion diffusion model generating 77-joint SOMA skeleton motion from natural language prompts.

---

## Installation

```bash
docker pull ghcr.io/eyalenav/kimodo-api:latest
```

### Run

```bash
docker run --rm \
  --gpus '"device=0"' \
  -p 9551:9551 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGINGFACE_TOKEN=hf_... \
  ghcr.io/eyalenav/kimodo-api:latest
```

> **First run:** downloads Llama-3-8B-Instruct (~16 GB) and Kimodo weights. Subsequent starts are fast (weights cached in `/root/.cache/huggingface`).

---

## API Reference

### `GET /health`

Check server status.

**Request**
```
GET http://localhost:9551/health
```

**Response**
```json
{
  "status": "ok"
}
```

---

### `POST /generate`

Generate a motion clip from a text prompt.

**Request**
```
POST http://localhost:9551/generate
Content-Type: application/json
```

```json
{
  "prompt": "person pushing through a crowd aggressively",
  "num_frames": 120,
  "fps": 30
}
```

| Field | Type | Default | Description |
|---|---|---|---|
| `prompt` | string | required | Natural language motion description |
| `num_frames` | int | `120` | Number of frames to generate |
| `fps` | int | `30` | Frames per second (metadata only) |

**Response**

Binary NPZ file (`application/octet-stream`).

The NPZ contains:
| Key | Shape | Description |
|---|---|---|
| `poses` | `(T, 77, 3)` | Joint rotations (axis-angle) per frame |
| `trans` | `(T, 3)` | Root translation per frame |
| `betas` | `(16,)` | SMPL body shape parameters |

**Example**
```bash
curl -X POST http://localhost:9551/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "person falling to the ground after being pushed"}' \
  --output output_motion.npz
```

---

### `POST /generate_bvh`

Generate motion and return as BVH (Biovision Hierarchy) format.

**Request**
```
POST http://localhost:9551/generate_bvh
Content-Type: application/json
```

```json
{
  "prompt": "two people fighting, punches thrown",
  "num_frames": 150
}
```

**Response**

BVH text file (`text/plain`).

**Example**
```bash
curl -X POST http://localhost:9551/generate_bvh \
  -H "Content-Type: application/json" \
  -d '{"prompt": "drunk person stumbling and falling"}' \
  --output output_motion.bvh
```

---

## Hardware Requirements

| Resource | Minimum | Recommended |
|---|---|---|
| GPU | RTX 3090 (24 GB VRAM) | RTX 6000 Ada / A100 |
| VRAM | 24 GB | 48 GB |
| RAM | 32 GB | 64 GB |
| Disk | 50 GB | 100 GB |
| CUDA | 12.1+ | 12.8 |

---

## Environment Variables

| Variable | Required | Description |
|---|---|---|
| `HUGGINGFACE_TOKEN` | Yes | HF token with access to `meta-llama/Meta-Llama-3-8B-Instruct` |
| `CUDA_VISIBLE_DEVICES` | No | Limit to specific GPU (e.g. `"0"`) |
| `PORT` | No | Override default port `9551` |

---

## Integration with VisionAI-Flywheel

`kimodo-api` is designed to run alongside `render-api` and `cosmos-transfer` as part of the full pipeline:

```yaml
# docker-compose.yml excerpt
services:
  kimodo-api:
    image: ghcr.io/eyalenav/kimodo-api:latest
    ports:
      - "9551:9551"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    volumes:
      - hf_cache:/root/.cache/huggingface
    environment:
      - HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN}
```

Full `docker-compose.yml`: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel)

---

## Example: Full Python client

```python
import requests
import numpy as np
import io

def generate_motion(prompt: str, num_frames: int = 120) -> dict:
    """Generate motion NPZ from text prompt."""
    response = requests.post(
        "http://localhost:9551/generate",
        json={"prompt": prompt, "num_frames": num_frames},
        timeout=120
    )
    response.raise_for_status()
    
    npz = np.load(io.BytesIO(response.content))
    return {
        "poses": npz["poses"],   # (T, 77, 3)
        "trans": npz["trans"],   # (T, 3)
        "betas": npz["betas"],   # (16,)
    }

# Example usage
motion = generate_motion("security guard running toward an incident")
print(f"Generated {motion['poses'].shape[0]} frames")
```

---

## License

Apache 2.0

> Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Weights are downloaded at runtime and are not bundled in this image.