ChessGPT-2

Model Description

ChessGPT-2 is a series of transformer language models specifically trained on chess game data, demonstrating that language models can learn complex strategic reasoning through chess gameplay. This repository presents large-16 (200M parameters) as our best model.

The large-16 model is a 200-million parameter GPT-2 architecture trained on engine-generated chess games, capable of high-quality move prediction, strategic analysis, and chess reasoning.

Model Details

large-16 (Primary Model)

Model Type: Autoregressive Transformer Language Model (GPT-2 architecture)
Parameters: ~200 million
Architecture:
- Layers: 16
- Attention Heads: 16
- Embedding Dimension: 1024
- Context Length: 1023 tokens
- Vocabulary Size: 32 tokens (chess-specific vocabulary)
Training Framework: NanoGPT (PyTorch)
Precision: Mixed precision training (bfloat16/float16)

Training Data

All models were trained on datasets from @adamkarvonen/chess_games:

Primary Dataset: Stockfish Games

Dataset: stockfish_dataset_blocks.zip
Description: 4.5GB of games generated by White playing as Stockfish ELO 3200 against a range of Stockfish ELO 1300-3200 as Black
Format: PGN (Portable Game Notation) games converted to 1024-character blocks
Tokenization: Each block begins with ";" delimiter (e.g., ";1.e4 e5 2.Nf3...")
Data Split: 99% training, 1% validation (random split with seed 2357)

Training Configuration

large-16 Training Settings

Batch Size: 32 (micro-batch)
Gradient Accumulation: 4 steps (effective batch size: 128)
Learning Rate: 3e-4 with cosine decay to 3e-5
Warmup: 2000 iterations
Max Iterations: 600,000
Optimizer: AdamW (β₁=0.9, β₂=0.95)
Dropout: 0.0 (no dropout for pretraining)
Training Hardware: RTX 3090/4090 GPUs with distributed training support

Usage

Loading the Model

import torch
from model import GPT, GPTConfig

# Load large-16 configuration
config = GPTConfig(
    block_size=1023,
    n_layer=16,
    n_head=16,
    n_embd=1024,
    dropout=0.0,
    bias=False,
    vocab_size=32
)

# Initialize and load model
model = GPT(config)
checkpoint = torch.load('ckpt.pt', map_location='cpu')
model.load_state_dict(checkpoint['model'])
model.eval()

# For GPU inference (recommended)
if torch.cuda.is_available():
    model = model.cuda()

# Generate chess moves (requires proper tokenization)
prompt = ";1.d4 Nf6 2.c4 e6 3.Nc3 Bb4"
# ... tokenization and generation code ...

Input Format

All models expect properly tokenized chess games:

Must start with ";" delimiter
Standard PGN algebraic notation
1024-character blocks for optimal performance

Performance Characteristics

The large-16 model demonstrates:

Superior Chess Reasoning: Advanced understanding of tactical and strategic patterns
High-Quality Planning: Excellent long-term game planning capabilities
Pattern Recognition: Enhanced recognition across diverse chess positions
Substantial Scale: 202.5M parameters in 2.3GB model size
Optimal Architecture: 16 layers, 16 heads, 1024 embedding dimension
Near-Expert Performance: Potential for expert-level chess understanding

Model Series & Ablation Studies

This repository represents extensive research into scaling transformer models for chess. Our complete series includes:

Parameter Scaling Ablations

Model Variant	Parameters	Layers	Heads	Embedding	Model Size	Val Loss	Key Characteristics
small-8	25.7M	8	8	512	294MB	0.2944	Compact baseline
small-16	50.9M	16	8	512	582MB	0.2725	Depth scaling study
small-24	76.1M	24	8	512	871MB	0.2628	Deep narrow model
small-36	113.8M	36	8	512	1.3GB	0.2583	Maximum depth
medium-12	85.8M	12	12	768	982MB	0.2652	Balanced medium
medium-16	114.1M	16	12	768	1.3GB	0.2608	Deeper medium
large-16	202.5M	16	16	1024	2.3GB	0.2578	Primary model

Dataset Comparison Studies

Model	Dataset	Source	Size	Characteristics
All Stockfish Models	Stockfish	Engine games	4.5GB	Optimal play patterns
Lichess Model	Lichess	Human games	6GB	Human decision patterns

Key Research Findings

Depth vs Width Trade-offs: Small models (512 emb, 8 heads) scale from 25.7M→113.8M parameters purely through depth (8→36 layers)
Clear Performance Scaling: Validation loss improves consistently with depth: 0.2944 (8-layer) → 0.2583 (36-layer)
Architecture Variations: Medium models explore width scaling (768 emb, 12 heads) vs small models' depth scaling
Parameter Efficiency: small-36 (113.8M) achieves similar parameter count to medium-16 (114.1M) via different architectures
No Overfitting: All models trained to 600k iterations show continued learning potential
Dataset Impact: Significant behavioral differences between engine vs. human training data

Evaluation Metrics

Models should be evaluated on:

Move Legality: Percentage of generated moves that are legal
Game Continuation: Quality and coherence of extended game sequences
Tactical Recognition: Ability to identify tactical patterns and combinations
Strategic Understanding: Long-term positional planning and evaluation
Opening Knowledge: Familiarity with established opening theory
Endgame Technique: Performance in simplified positions

Intended Use

Primary Use Cases

Chess Analysis: High-quality position evaluation and move suggestion
Research: Studying emergent reasoning in language models
Education: Chess learning and pattern recognition tools
AI Development: Baseline for chess AI systems

Limitations

Specialized for chess gameplay only
Limited to standard chess rules and notation
Requires proper tokenization format
GPU recommended for practical inference
May not generalize beyond chess domain

Alternative Model Variants

For Different Use Cases:

Fast Inference: Use small-8 for minimal resource requirements
Depth vs Width: Compare small-16/24/36 for layer depth ablations
Balanced Performance: Use medium-12 or medium-16 for mid-range applications
Maximum Performance: Use large-16 for best overall results
Human Behavior Studies: Use lichess model for human-like gameplay patterns

Computational Requirements:

Small Models (8-36 layers): CPU inference possible, GPU recommended
Medium Models: GPU recommended for practical use
Large Model: Single high-end GPU required

Technical Implementation

Model Architecture

Based on GPT-2 with chess-specific adaptations:

Vocabulary: Reduced to 32 chess-specific tokens
Context: Optimized for 1023-token chess game sequences
Training: Custom data loading for chess game blocks
Framework: Built on NanoGPT for simplicity and efficiency

Training Insights

Convergence: Smooth training curves across all scales
Memory Efficiency: Optimized for multi-GPU training
Data Processing: Custom tokenization preserving chess structure
Evaluation: Chess-specific validation metrics

Ethical Considerations

Models trained exclusively on chess data pose minimal ethical risks
No personal data or sensitive information in training datasets
Intended for educational, research, and recreational purposes
Computational requirements may limit accessibility
Models do not generalize beyond chess domain

Citation

If you use ChessGPT in your research, please cite:

@misc{chessgpt,
  title={ChessGPT-2},
  author={[Your Name]},
  year={2024},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/[your-username]/chessgpt-2}
}

@dataset{chess_games_dataset,
  title={Chess Games Dataset},
  author={Adam Karvonen},
  year={2024},
  url={https://huggingface.co/datasets/adamkarvonen/chess_games}
}

References

NanoGPT: karpathy/nanoGPT
Chess Dataset: @adamkarvonen/chess_games
GPT-2 Paper: Radford et al., 2019
Scaling Laws: Kaplan et al., 2020

License

[MIT, Apache 2.0]

jd0g
/

chess-gpt

You need to agree to share your contact information to access this model