FineWeb GPT — trained from scratch

A GPT-style language model trained completely from scratch as a learning exercise. Every component was written from scratch: BPE tokenizer, transformer architecture, and training loop.

Architecture


Parameters	8.4M
Layers	6
d_model	256
Attention heads	8
Context length	512
Vocabulary	8,192 (BPE ByteLevel)
Positional encoding	RoPE
Normalization	RMSNorm
Activation	SwiGLU

Training


Dataset	FineWeb-Edu sample-10BT (~5M tokens)
Steps	1,800
Optimizer	AdamW, cosine LR + warmup
Val loss	5.2764
Perplexity	195.7
Hardware	Apple Silicon MPS

Load the tokenizer

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("REPO_ID")
print(tokenizer("The study of mathematics").tokens())

Limitations

Learning exercise only — trained on ~5M tokens, perplexity 196. Outputs are repetitive and often incoherent.

Stack

PyTorch · HuggingFace datasets · tokenizers · wandb · huggingface_hub

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support