AI & ML interests

None defined yet.

rob-x-ai 
posted an update 5 days ago
view post
Post
136
Genesis 1B is now public. 🔥

I’m training a 1.003B parameter model from scratch on 2× RTX 4090s and opened a public playground for early checkpoints.

The real bottleneck wasn’t training.
It was checkpointing:

FSDP full-state gather over PCIe = NCCL timeout hell

Switching to DCP sharded checkpoints changed the trajectory of the run.

- Playground: rob-x-ai/genesis-1b-playground
- Write-up: https://kroonen.ai/blog/distributed-checkpoint-failures-rtx4090/