FineWeb GPT — trained from scratch
A GPT-style language model trained completely from scratch as a learning exercise. Every component was written from scratch: BPE tokenizer, transformer architecture, and training loop.
Architecture
| Parameters | 8.4M |
| Layers | 6 |
| d_model | 256 |
| Attention heads | 8 |
| Context length | 512 |
| Vocabulary | 8,192 (BPE ByteLevel) |
| Positional encoding | RoPE |
| Normalization | RMSNorm |
| Activation | SwiGLU |
Training
| Dataset | FineWeb-Edu sample-10BT (~5M tokens) |
| Steps | 1,800 |
| Optimizer | AdamW, cosine LR + warmup |
| Val loss | 5.2764 |
| Perplexity | 195.7 |
| Hardware | Apple Silicon MPS |
Load the tokenizer
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
print(tokenizer("The study of mathematics").tokens())
Limitations
Learning exercise only — trained on ~5M tokens, perplexity 196. Outputs are repetitive and often incoherent.
Stack
PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub
- Downloads last month
- 10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support