ChessFormer-RL

ChessFormer-RL represents an experimental checkpoint in training chess models with reinforcement learning. Note: This model is actually the 8th supervised learning checkpoint (49152 steps) that was intended as initialization for RL training, as the full RL training encountered challenges.

Model Description

Model type: Transformer for chess (RL training initialization)
Language(s): Chess (FEN notation)
License: MIT
Parameters: 100.7M

Important Notice

⚠️ This model represents a research checkpoint rather than a completed RL-trained model. The actual reinforcement learning training encountered:

Gradient norm explosion
Noisy reward signals
Performance degradation from this initialization point

This checkpoint is provided for researchers interested in:

RL training initialization strategies
Comparative analysis with the final SL model
Continuing RL experiments with improved methods

Architecture

Identical to ChessFormer-SL:

Blocks: 20 transformer layers
Hidden size: 640
Attention heads: 8
Intermediate size: 1728
Features: RMSNorm, SwiGLU activation, custom FEN tokenizer

Training Details

Phase 1: Supervised Learning (This Checkpoint)

Dataset: kaupane/lichess-2023-01-stockfish-annotated (depth18 split)
Training: 49152 steps of supervised learning on Stockfish evaluations
Purpose: Initialization for subsequent RL training

Phase 2: Reinforcement Learning (Attempted)

Method: Self-play with Proximal Policy Optimization (PPO)
Environment: Batch chess environment with sparse terminal rewards
Outcome: Training instabilities led to performance degradation
Current Status: Requires further research and improved RL methodology

Training Metrics (This Checkpoint)

Action Loss: 1.8329
Value Loss: 0.0501
Invalid Loss: 0.0484

Performance

As an intermediate SL checkpoint, this model exhibits:

Similar capabilities to early ChessFormer-SL training
Less refined than the final SL model
Suitable for RL initialization experiments

Comparison with ChessFormer-SL

Metric	ChessFormer-RL (8th ckpt)	ChessFormer-SL (20th ckpt)
Action Loss	1.8329	1.6985
Value Loss	0.0501	0.0407
Invalid Loss	0.0484	0.0303

Research Context

RL Training Challenges Encountered

Gradient Instability: Explosive gradient norms during PPO updates
Sparse Rewards: Terminal-only rewards created noisy learning signals
Action Space Complexity: 1,969 possible moves created exploration challenges
Self-Play Dynamics: Unstable opponent strength during training

Usage

Installation

pip install torch transformers huggingface_hub chess
# Download model.py from this repository

Loading the Model

import torch
from model import ChessFormerModel

# Load model
model = ChessFormerModel.from_pretrained("kaupane/ChessFormer-RL")
model.eval()

# This is an intermediate checkpoint - performance will be lower than ChessFormer-SL

For RL Research

# This checkpoint can serve as initialization for RL experiments
from train_rl import RLTrainer

# Load checkpoint for RL training continuation
trainer = RLTrainer(
    model=model,
    # ... other hyperparameters
)
trainer.resume("path/to/checkpoint", from_sl_checkpoint=True)

Limitations

Technical Limitations

Incomplete Training: Represents intermediate rather than final model
RL Instabilities: Subsequent RL training was unsuccessful
Performance: Lower quality than ChessFormer-SL final checkpoint

Research Limitations

Demonstrates challenges rather than solutions for chess RL
Requires significant additional work for competitive performance
Not suitable for production use

Intended Use

This model is specifically intended for:

✅ RL research and experimentation
✅ Studying initialization strategies for chess RL
✅ Comparative analysis of SL vs RL training trajectories
✅ Educational purposes in understanding RL challenges

Not intended for:

❌ Practical chess playing applications
❌ Production chess engines
❌ Competitive chess analysis

Additional Information

Repository: GitHub link
Demo: HuggingFace Space Demo
Related: ChessFormer-SL (Completed SL Training)

This model represents ongoing research into chess RL training. While the full RL training was unsuccessful, this checkpoint may be an initial starting point for future research directions.

Downloads last month: 4

Video Preview

Reinforcement Learning

Space using kaupane/ChessFormer-RL 1

Collection including kaupane/ChessFormer-RL

Chess-Transformer

Collection

5 items • Updated Jun 4