Architecture

  • Decoder-only Transformer (GPT-style)
  • 12 layers
  • Hidden size: 768
  • Attention heads: 12
  • Context length: 512
  • Parameters: ~100M

Training

  • Dataset: News articles (CNN/DailyMail – articles only)
  • Objective: Causal Language Modeling
  • Hardware: Google Colab GPU
  • Precision: FP16
  • Training steps: 2000
  • Optimizations: Gradient checkpointing, gradient accumulation

Training Loss Curve

Training Loss Curve

The training loss decreased steadily from approximately 9.1 to 5.3 over 2000 training steps, indicating stable convergence during from-scratch training of the 100M-parameter language model.

Intended Use

  • Research
  • Educational purposes
  • Text generation experiments

Limitations

  • Not instruction-tuned
  • Trained for limited steps
  • Outputs may be verbose or repetitive
Downloads last month
25
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support