You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ChessGPT-2

Model Description

ChessGPT-2 is a series of transformer language models specifically trained on chess game data, demonstrating that language models can learn complex strategic reasoning through chess gameplay. This repository presents large-16 (200M parameters) as our best model.

The large-16 model is a 200-million parameter GPT-2 architecture trained on engine-generated chess games, capable of high-quality move prediction, strategic analysis, and chess reasoning.

Model Details

large-16 (Primary Model)

  • Model Type: Autoregressive Transformer Language Model (GPT-2 architecture)
  • Parameters: ~200 million
  • Architecture:
    • Layers: 16
    • Attention Heads: 16
    • Embedding Dimension: 1024
    • Context Length: 1023 tokens
    • Vocabulary Size: 32 tokens (chess-specific vocabulary)
  • Training Framework: NanoGPT (PyTorch)
  • Precision: Mixed precision training (bfloat16/float16)

Training Data

All models were trained on datasets from @adamkarvonen/chess_games:

Primary Dataset: Stockfish Games

  • Dataset: stockfish_dataset_blocks.zip
  • Description: 4.5GB of games generated by White playing as Stockfish ELO 3200 against a range of Stockfish ELO 1300-3200 as Black
  • Format: PGN (Portable Game Notation) games converted to 1024-character blocks
  • Tokenization: Each block begins with ";" delimiter (e.g., ";1.e4 e5 2.Nf3...")
  • Data Split: 99% training, 1% validation (random split with seed 2357)

Training Configuration

large-16 Training Settings

  • Batch Size: 32 (micro-batch)
  • Gradient Accumulation: 4 steps (effective batch size: 128)
  • Learning Rate: 3e-4 with cosine decay to 3e-5
  • Warmup: 2000 iterations
  • Max Iterations: 600,000
  • Optimizer: AdamW (ฮฒโ‚=0.9, ฮฒโ‚‚=0.95)
  • Dropout: 0.0 (no dropout for pretraining)
  • Training Hardware: RTX 3090/4090 GPUs with distributed training support

Usage

Loading the Model

import torch
from model import GPT, GPTConfig

# Load large-16 configuration
config = GPTConfig(
    block_size=1023,
    n_layer=16,
    n_head=16,
    n_embd=1024,
    dropout=0.0,
    bias=False,
    vocab_size=32
)

# Initialize and load model
model = GPT(config)
checkpoint = torch.load('ckpt.pt', map_location='cpu')
model.load_state_dict(checkpoint['model'])
model.eval()

# For GPU inference (recommended)
if torch.cuda.is_available():
    model = model.cuda()

# Generate chess moves (requires proper tokenization)
prompt = ";1.d4 Nf6 2.c4 e6 3.Nc3 Bb4"
# ... tokenization and generation code ...

Input Format

All models expect properly tokenized chess games:

  • Must start with ";" delimiter
  • Standard PGN algebraic notation
  • 1024-character blocks for optimal performance

Performance Characteristics

The large-16 model demonstrates:

  • Superior Chess Reasoning: Advanced understanding of tactical and strategic patterns
  • High-Quality Planning: Excellent long-term game planning capabilities
  • Pattern Recognition: Enhanced recognition across diverse chess positions
  • Substantial Scale: 202.5M parameters in 2.3GB model size
  • Optimal Architecture: 16 layers, 16 heads, 1024 embedding dimension
  • Near-Expert Performance: Potential for expert-level chess understanding

Model Series & Ablation Studies

This repository represents extensive research into scaling transformer models for chess. Our complete series includes:

Parameter Scaling Ablations

Model Variant Parameters Layers Heads Embedding Model Size Val Loss Key Characteristics
small-8 25.7M 8 8 512 294MB 0.2944 Compact baseline
small-16 50.9M 16 8 512 582MB 0.2725 Depth scaling study
small-24 76.1M 24 8 512 871MB 0.2628 Deep narrow model
small-36 113.8M 36 8 512 1.3GB 0.2583 Maximum depth
medium-12 85.8M 12 12 768 982MB 0.2652 Balanced medium
medium-16 114.1M 16 12 768 1.3GB 0.2608 Deeper medium
large-16 202.5M 16 16 1024 2.3GB 0.2578 Primary model

Dataset Comparison Studies

Model Dataset Source Size Characteristics
All Stockfish Models Stockfish Engine games 4.5GB Optimal play patterns
Lichess Model Lichess Human games 6GB Human decision patterns

Key Research Findings

  1. Depth vs Width Trade-offs: Small models (512 emb, 8 heads) scale from 25.7Mโ†’113.8M parameters purely through depth (8โ†’36 layers)
  2. Clear Performance Scaling: Validation loss improves consistently with depth: 0.2944 (8-layer) โ†’ 0.2583 (36-layer)
  3. Architecture Variations: Medium models explore width scaling (768 emb, 12 heads) vs small models' depth scaling
  4. Parameter Efficiency: small-36 (113.8M) achieves similar parameter count to medium-16 (114.1M) via different architectures
  5. No Overfitting: All models trained to 600k iterations show continued learning potential
  6. Dataset Impact: Significant behavioral differences between engine vs. human training data

Evaluation Metrics

Models should be evaluated on:

  • Move Legality: Percentage of generated moves that are legal
  • Game Continuation: Quality and coherence of extended game sequences
  • Tactical Recognition: Ability to identify tactical patterns and combinations
  • Strategic Understanding: Long-term positional planning and evaluation
  • Opening Knowledge: Familiarity with established opening theory
  • Endgame Technique: Performance in simplified positions

Intended Use

Primary Use Cases

  • Chess Analysis: High-quality position evaluation and move suggestion
  • Research: Studying emergent reasoning in language models
  • Education: Chess learning and pattern recognition tools
  • AI Development: Baseline for chess AI systems

Limitations

  • Specialized for chess gameplay only
  • Limited to standard chess rules and notation
  • Requires proper tokenization format
  • GPU recommended for practical inference
  • May not generalize beyond chess domain

Alternative Model Variants

For Different Use Cases:

  • Fast Inference: Use small-8 for minimal resource requirements
  • Depth vs Width: Compare small-16/24/36 for layer depth ablations
  • Balanced Performance: Use medium-12 or medium-16 for mid-range applications
  • Maximum Performance: Use large-16 for best overall results
  • Human Behavior Studies: Use lichess model for human-like gameplay patterns

Computational Requirements:

  • Small Models (8-36 layers): CPU inference possible, GPU recommended
  • Medium Models: GPU recommended for practical use
  • Large Model: Single high-end GPU required

Technical Implementation

Model Architecture

Based on GPT-2 with chess-specific adaptations:

  • Vocabulary: Reduced to 32 chess-specific tokens
  • Context: Optimized for 1023-token chess game sequences
  • Training: Custom data loading for chess game blocks
  • Framework: Built on NanoGPT for simplicity and efficiency

Training Insights

  • Convergence: Smooth training curves across all scales
  • Memory Efficiency: Optimized for multi-GPU training
  • Data Processing: Custom tokenization preserving chess structure
  • Evaluation: Chess-specific validation metrics

Ethical Considerations

  • Models trained exclusively on chess data pose minimal ethical risks
  • No personal data or sensitive information in training datasets
  • Intended for educational, research, and recreational purposes
  • Computational requirements may limit accessibility
  • Models do not generalize beyond chess domain

Citation

If you use ChessGPT in your research, please cite:

@misc{chessgpt,
  title={ChessGPT-2},
  author={[Your Name]},
  year={2024},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/[your-username]/chessgpt-2}
}

@dataset{chess_games_dataset,
  title={Chess Games Dataset},
  author={Adam Karvonen},
  year={2024},
  url={https://huggingface.co/datasets/adamkarvonen/chess_games}
}

References

License

[MIT, Apache 2.0]


Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train jd0g/chess-gpt

Evaluation results

  • Best Validation Loss (large-16) on Chess Games Dataset
    self-reported
    0.258