File size: 4,589 Bytes
2cb21d2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | # kimodo-api
REST API microservice wrapper around [NVIDIA Kimodo](https://github.com/nv-tlabs/kimodo) — text-to-motion diffusion model generating 77-joint SOMA skeleton motion from natural language prompts.
---
## Installation
```bash
docker pull ghcr.io/eyalenav/kimodo-api:latest
```
### Run
```bash
docker run --rm \
--gpus '"device=0"' \
-p 9551:9551 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGINGFACE_TOKEN=hf_... \
ghcr.io/eyalenav/kimodo-api:latest
```
> **First run:** downloads Llama-3-8B-Instruct (~16 GB) and Kimodo weights. Subsequent starts are fast (weights cached in `/root/.cache/huggingface`).
---
## API Reference
### `GET /health`
Check server status.
**Request**
```
GET http://localhost:9551/health
```
**Response**
```json
{
"status": "ok"
}
```
---
### `POST /generate`
Generate a motion clip from a text prompt.
**Request**
```
POST http://localhost:9551/generate
Content-Type: application/json
```
```json
{
"prompt": "person pushing through a crowd aggressively",
"num_frames": 120,
"fps": 30
}
```
| Field | Type | Default | Description |
|---|---|---|---|
| `prompt` | string | required | Natural language motion description |
| `num_frames` | int | `120` | Number of frames to generate |
| `fps` | int | `30` | Frames per second (metadata only) |
**Response**
Binary NPZ file (`application/octet-stream`).
The NPZ contains:
| Key | Shape | Description |
|---|---|---|
| `poses` | `(T, 77, 3)` | Joint rotations (axis-angle) per frame |
| `trans` | `(T, 3)` | Root translation per frame |
| `betas` | `(16,)` | SMPL body shape parameters |
**Example**
```bash
curl -X POST http://localhost:9551/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "person falling to the ground after being pushed"}' \
--output output_motion.npz
```
---
### `POST /generate_bvh`
Generate motion and return as BVH (Biovision Hierarchy) format.
**Request**
```
POST http://localhost:9551/generate_bvh
Content-Type: application/json
```
```json
{
"prompt": "two people fighting, punches thrown",
"num_frames": 150
}
```
**Response**
BVH text file (`text/plain`).
**Example**
```bash
curl -X POST http://localhost:9551/generate_bvh \
-H "Content-Type: application/json" \
-d '{"prompt": "drunk person stumbling and falling"}' \
--output output_motion.bvh
```
---
## Hardware Requirements
| Resource | Minimum | Recommended |
|---|---|---|
| GPU | RTX 3090 (24 GB VRAM) | RTX 6000 Ada / A100 |
| VRAM | 24 GB | 48 GB |
| RAM | 32 GB | 64 GB |
| Disk | 50 GB | 100 GB |
| CUDA | 12.1+ | 12.8 |
---
## Environment Variables
| Variable | Required | Description |
|---|---|---|
| `HUGGINGFACE_TOKEN` | Yes | HF token with access to `meta-llama/Meta-Llama-3-8B-Instruct` |
| `CUDA_VISIBLE_DEVICES` | No | Limit to specific GPU (e.g. `"0"`) |
| `PORT` | No | Override default port `9551` |
---
## Integration with VisionAI-Flywheel
`kimodo-api` is designed to run alongside `render-api` and `cosmos-transfer` as part of the full pipeline:
```yaml
# docker-compose.yml excerpt
services:
kimodo-api:
image: ghcr.io/eyalenav/kimodo-api:latest
ports:
- "9551:9551"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
volumes:
- hf_cache:/root/.cache/huggingface
environment:
- HUGGINGFACE_TOKEN=${HUGGINGFACE_TOKEN}
```
Full `docker-compose.yml`: [github.com/EyalEnav/VisionAI-Flywheel](https://github.com/EyalEnav/VisionAI-Flywheel)
---
## Example: Full Python client
```python
import requests
import numpy as np
import io
def generate_motion(prompt: str, num_frames: int = 120) -> dict:
"""Generate motion NPZ from text prompt."""
response = requests.post(
"http://localhost:9551/generate",
json={"prompt": prompt, "num_frames": num_frames},
timeout=120
)
response.raise_for_status()
npz = np.load(io.BytesIO(response.content))
return {
"poses": npz["poses"], # (T, 77, 3)
"trans": npz["trans"], # (T, 3)
"betas": npz["betas"], # (16,)
}
# Example usage
motion = generate_motion("security guard running toward an incident")
print(f"Generated {motion['poses'].shape[0]} frames")
```
---
## License
Apache 2.0
> Kimodo model weights are released under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Weights are downloaded at runtime and are not bundled in this image.
|