FineWeb GPT — trained from scratch

A GPT-style language model trained completely from scratch as a learning exercise. Every component was written from scratch: BPE tokenizer, transformer architecture, and training loop.

Architecture

Parameters 8.4M
Layers 6
d_model 256
Attention heads 8
Context length 512
Vocabulary 8,192 (BPE ByteLevel)
Positional encoding RoPE
Normalization RMSNorm
Activation SwiGLU

Training

Dataset FineWeb-Edu sample-10BT (~5M tokens)
Steps 1,800
Optimizer AdamW, cosine LR + warmup
Val loss 5.2764
Perplexity 195.7
Hardware Apple Silicon MPS

Load the tokenizer

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
print(tokenizer("The study of mathematics").tokens())

Limitations

Learning exercise only — trained on ~5M tokens, perplexity 196. Outputs are repetitive and often incoherent.

Stack

PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support