ChessGPT-2
Model Description
ChessGPT-2 is a series of transformer language models specifically trained on chess game data, demonstrating that language models can learn complex strategic reasoning through chess gameplay. This repository presents large-16 (200M parameters) as our best model.
The large-16 model is a 200-million parameter GPT-2 architecture trained on engine-generated chess games, capable of high-quality move prediction, strategic analysis, and chess reasoning.
Model Details
large-16 (Primary Model)
- Model Type: Autoregressive Transformer Language Model (GPT-2 architecture)
- Parameters: ~200 million
- Architecture:
- Layers: 16
- Attention Heads: 16
- Embedding Dimension: 1024
- Context Length: 1023 tokens
- Vocabulary Size: 32 tokens (chess-specific vocabulary)
- Training Framework: NanoGPT (PyTorch)
- Precision: Mixed precision training (bfloat16/float16)
Training Data
All models were trained on datasets from @adamkarvonen/chess_games:
Primary Dataset: Stockfish Games
- Dataset:
stockfish_dataset_blocks.zip
- Description: 4.5GB of games generated by White playing as Stockfish ELO 3200 against a range of Stockfish ELO 1300-3200 as Black
- Format: PGN (Portable Game Notation) games converted to 1024-character blocks
- Tokenization: Each block begins with ";" delimiter (e.g., ";1.e4 e5 2.Nf3...")
- Data Split: 99% training, 1% validation (random split with seed 2357)
Training Configuration
large-16 Training Settings
- Batch Size: 32 (micro-batch)
- Gradient Accumulation: 4 steps (effective batch size: 128)
- Learning Rate: 3e-4 with cosine decay to 3e-5
- Warmup: 2000 iterations
- Max Iterations: 600,000
- Optimizer: AdamW (ฮฒโ=0.9, ฮฒโ=0.95)
- Dropout: 0.0 (no dropout for pretraining)
- Training Hardware: RTX 3090/4090 GPUs with distributed training support
Usage
Loading the Model
import torch
from model import GPT, GPTConfig
# Load large-16 configuration
config = GPTConfig(
block_size=1023,
n_layer=16,
n_head=16,
n_embd=1024,
dropout=0.0,
bias=False,
vocab_size=32
)
# Initialize and load model
model = GPT(config)
checkpoint = torch.load('ckpt.pt', map_location='cpu')
model.load_state_dict(checkpoint['model'])
model.eval()
# For GPU inference (recommended)
if torch.cuda.is_available():
model = model.cuda()
# Generate chess moves (requires proper tokenization)
prompt = ";1.d4 Nf6 2.c4 e6 3.Nc3 Bb4"
# ... tokenization and generation code ...
Input Format
All models expect properly tokenized chess games:
- Must start with ";" delimiter
- Standard PGN algebraic notation
- 1024-character blocks for optimal performance
Performance Characteristics
The large-16 model demonstrates:
- Superior Chess Reasoning: Advanced understanding of tactical and strategic patterns
- High-Quality Planning: Excellent long-term game planning capabilities
- Pattern Recognition: Enhanced recognition across diverse chess positions
- Substantial Scale: 202.5M parameters in 2.3GB model size
- Optimal Architecture: 16 layers, 16 heads, 1024 embedding dimension
- Near-Expert Performance: Potential for expert-level chess understanding
Model Series & Ablation Studies
This repository represents extensive research into scaling transformer models for chess. Our complete series includes:
Parameter Scaling Ablations
Model Variant | Parameters | Layers | Heads | Embedding | Model Size | Val Loss | Key Characteristics |
---|---|---|---|---|---|---|---|
small-8 | 25.7M | 8 | 8 | 512 | 294MB | 0.2944 | Compact baseline |
small-16 | 50.9M | 16 | 8 | 512 | 582MB | 0.2725 | Depth scaling study |
small-24 | 76.1M | 24 | 8 | 512 | 871MB | 0.2628 | Deep narrow model |
small-36 | 113.8M | 36 | 8 | 512 | 1.3GB | 0.2583 | Maximum depth |
medium-12 | 85.8M | 12 | 12 | 768 | 982MB | 0.2652 | Balanced medium |
medium-16 | 114.1M | 16 | 12 | 768 | 1.3GB | 0.2608 | Deeper medium |
large-16 | 202.5M | 16 | 16 | 1024 | 2.3GB | 0.2578 | Primary model |
Dataset Comparison Studies
Model | Dataset | Source | Size | Characteristics |
---|---|---|---|---|
All Stockfish Models | Stockfish | Engine games | 4.5GB | Optimal play patterns |
Lichess Model | Lichess | Human games | 6GB | Human decision patterns |
Key Research Findings
- Depth vs Width Trade-offs: Small models (512 emb, 8 heads) scale from 25.7Mโ113.8M parameters purely through depth (8โ36 layers)
- Clear Performance Scaling: Validation loss improves consistently with depth: 0.2944 (8-layer) โ 0.2583 (36-layer)
- Architecture Variations: Medium models explore width scaling (768 emb, 12 heads) vs small models' depth scaling
- Parameter Efficiency: small-36 (113.8M) achieves similar parameter count to medium-16 (114.1M) via different architectures
- No Overfitting: All models trained to 600k iterations show continued learning potential
- Dataset Impact: Significant behavioral differences between engine vs. human training data
Evaluation Metrics
Models should be evaluated on:
- Move Legality: Percentage of generated moves that are legal
- Game Continuation: Quality and coherence of extended game sequences
- Tactical Recognition: Ability to identify tactical patterns and combinations
- Strategic Understanding: Long-term positional planning and evaluation
- Opening Knowledge: Familiarity with established opening theory
- Endgame Technique: Performance in simplified positions
Intended Use
Primary Use Cases
- Chess Analysis: High-quality position evaluation and move suggestion
- Research: Studying emergent reasoning in language models
- Education: Chess learning and pattern recognition tools
- AI Development: Baseline for chess AI systems
Limitations
- Specialized for chess gameplay only
- Limited to standard chess rules and notation
- Requires proper tokenization format
- GPU recommended for practical inference
- May not generalize beyond chess domain
Alternative Model Variants
For Different Use Cases:
- Fast Inference: Use small-8 for minimal resource requirements
- Depth vs Width: Compare small-16/24/36 for layer depth ablations
- Balanced Performance: Use medium-12 or medium-16 for mid-range applications
- Maximum Performance: Use large-16 for best overall results
- Human Behavior Studies: Use lichess model for human-like gameplay patterns
Computational Requirements:
- Small Models (8-36 layers): CPU inference possible, GPU recommended
- Medium Models: GPU recommended for practical use
- Large Model: Single high-end GPU required
Technical Implementation
Model Architecture
Based on GPT-2 with chess-specific adaptations:
- Vocabulary: Reduced to 32 chess-specific tokens
- Context: Optimized for 1023-token chess game sequences
- Training: Custom data loading for chess game blocks
- Framework: Built on NanoGPT for simplicity and efficiency
Training Insights
- Convergence: Smooth training curves across all scales
- Memory Efficiency: Optimized for multi-GPU training
- Data Processing: Custom tokenization preserving chess structure
- Evaluation: Chess-specific validation metrics
Ethical Considerations
- Models trained exclusively on chess data pose minimal ethical risks
- No personal data or sensitive information in training datasets
- Intended for educational, research, and recreational purposes
- Computational requirements may limit accessibility
- Models do not generalize beyond chess domain
Citation
If you use ChessGPT in your research, please cite:
@misc{chessgpt,
title={ChessGPT-2},
author={[Your Name]},
year={2024},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/[your-username]/chessgpt-2}
}
@dataset{chess_games_dataset,
title={Chess Games Dataset},
author={Adam Karvonen},
year={2024},
url={https://huggingface.co/datasets/adamkarvonen/chess_games}
}
References
- NanoGPT: karpathy/nanoGPT
- Chess Dataset: @adamkarvonen/chess_games
- GPT-2 Paper: Radford et al., 2019
- Scaling Laws: Kaplan et al., 2020
License
[MIT, Apache 2.0]
- Downloads last month
- -
Dataset used to train jd0g/chess-gpt
Evaluation results
- Best Validation Loss (large-16) on Chess Games Datasetself-reported0.258