ChessFormer-RL

ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. Note: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges.

Model Description

  • Model type: Transformer for chess (RL training initialization)
  • Language(s): Chess (FEN notation)
  • License: MIT
  • Parameters: 100.7M

Important Notice

⚠️ This model represents a research checkpoint rather than a completed RL-trained model. The actual reinforcement learning training encountered:

  • Gradient norm explosion
  • Noisy reward signals
  • Performance degradation from this initialization point

This checkpoint is provided for researchers interested in:

  • RL training initialization strategies
  • Comparative analysis with the final SL model
  • Continuing RL experiments with improved methods

Architecture

Identical to ChessFormer-SL:

  • Blocks: 20 transformer layers
  • Hidden size: 640
  • Attention heads: 8
  • Intermediate size: 1728
  • Features: RMSNorm, SwiGLU activation, custom FEN tokenizer

Training Details

Phase 1: Supervised Learning (This Checkpoint)

  • Dataset: kaupane/lichess-2023-01-stockfish-annotated (depth18 split)
  • Training: 49152 steps of supervised learning on Stockfish evaluations
  • Purpose: Initialization for subsequent RL training

Phase 2: Reinforcement Learning (Attempted)

  • Method: Self-play with Proximal Policy Optimization (PPO)
  • Environment: Batch chess environment with sparse terminal rewards
  • Outcome: Training instabilities led to performance degradation
  • Current Status: Requires further research and improved RL methodology

Training Metrics (This Checkpoint)

  • Action Loss: 1.8329
  • Value Loss: 0.0501
  • Invalid Loss: 0.0484

Performance

As an intermediate SL checkpoint, this model exhibits:

  • Similar capabilities to early ChessFormer-SL training
  • Less refined than the final SL model
  • Suitable for RL initialization experiments

Comparison with ChessFormer-SL

Metric ChessFormer-RL (8th ckpt) ChessFormer-SL (20th ckpt)
Action Loss 1.8329 1.6985
Value Loss 0.0501 0.0407
Invalid Loss 0.0484 0.0303

Research Context

RL Training Challenges Encountered

  1. Gradient Instability: Explosive gradient norms during PPO updates
  2. Sparse Rewards: Terminal-only rewards created noisy learning signals
  3. Action Space Complexity: 1,969 possible moves created exploration challenges
  4. Self-Play Dynamics: Unstable opponent strength during training

Usage

Installation

pip install torch transformers huggingface_hub chess
# Download model.py from this repository

Loading the Model

import torch
from model import ChessFormerModel

# Load model
model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL")
model.eval()

# This is an intermediate checkpoint - performance will be lower than ChessFormer-SL

For RL Research

# This checkpoint can serve as initialization for RL experiments
from train_rl import RLTrainer

# Load checkpoint for RL training continuation
trainer = RLTrainer(
    model=model,
    # ... other hyperparameters
)
trainer.resume("path/to/checkpoint", from_sl_checkpoint=True)

Limitations

Technical Limitations

  • Incomplete Training: Represents intermediate rather than final model
  • RL Instabilities: Subsequent RL training was unsuccessful
  • Performance: Lower quality than ChessFormer-SL final checkpoint

Research Limitations

  • Demonstrates challenges rather than solutions for chess RL
  • Requires significant additional work for competitive performance
  • Not suitable for production use

Intended Use

This model is specifically intended for:

  • βœ… RL research and experimentation
  • βœ… Studying initialization strategies for chess RL
  • βœ… Comparative analysis of SL vs RL training trajectories
  • βœ… Educational purposes in understanding RL challenges

Not intended for:

  • ❌ Practical chess playing applications
  • ❌ Production chess engines
  • ❌ Competitive chess analysis

Additional Information

This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.

Downloads last month
4
Video Preview
loading

Space using kaupane/ChessFormer-RL 1

Collection including kaupane/ChessFormer-RL