ismail - "Is My AI Lame?"

ismail is a from-scratch Turkish language model implementation designed for low-end hardware, built and trained on a single RTX 5070 (12GB). This is my first LLM project, heavily inspired by DeepSeek-V3 and built with guidance from LLMs-from-scratch.

Language Focus: ismail is trained exclusively on Turkish datasets using a custom morphology-aware tokenizer optimized for Turkish's agglutinative structure.

Status: Pretraining is currently ongoing on Turkish text with a single 5070 GPU. This will take a while!

Architecture Highlights

ismail implements several advanced techniques optimized for memory-constrained environments:

Multi-Head Latent Attention (MLA): DeepSeek-inspired attention mechanism with LoRA-style compression
- KV cache compression via low-rank projection (kv_lora_rank: 512/256)
- Separate RoPE and non-RoPE attention heads
- Reduced memory footprint for longer sequences
Mixture of Experts (MoE): Efficient sparse expert routing
- Routed experts: 4-6 experts with top-2 activation
- Shared experts for common knowledge
- Sequential expert training for limited VRAM
- Configurable expert rotation during training
YaRN RoPE: Extended context length support
- Dynamic frequency scaling based on sequence length
- Smooth interpolation for position embeddings
- Support for sequences beyond training length
Custom Kernels: Triton-based GPU kernels for FP8 quantization
- Optimized matrix multiplication
- Activation and weight quantization
- Memory-efficient inference
Turkish Morphological Tokenizer: Custom hybrid tokenizer designed for Turkish
- Combines rule-based morphological analysis with BPE
- Preserves linguistic structure (roots, suffixes, phonological rules)
- Based on research: "Tokens with Meaning"
- 32,768 vocabulary size optimized for Turkish

Model Configuration

Current Training Config (512-dim model for 12GB GPU):

{
  "vocab_size": 32768,
  "dim": 512,
  "n_layers": 16,
  "n_heads": 12,
  "n_routed_experts": 4,
  "n_activated_experts": 2,
  "max_seq_len": 512,
  "kv_lora_rank": 256
}

Full-Scale Config (1024-dim model):

1024 hidden dimensions
20 layers (3 dense + 17 MoE)
6 routed experts per MoE layer
Support for 2048+ token sequences

Project Structure

ismail/
├── Model_Architecture/
│   ├── model.py              # Core model implementation
│   ├── train.py              # Training loop with expert rotation
│   ├── generation.py         # Text generation and sampling
│   ├── data.py               # Dataset and data loading
│   ├── kernel.py             # Custom Triton kernels for FP8
│   ├── config.json           # Model and training configuration
│   └── requirements.txt      # Dependencies
├── LiteratureReview/
│   ├── Deepseek-V3/          # DeepSeek architecture analysis
│   ├── GPT-2/                # GPT-2 baseline implementations
│   ├── Llama/                # Llama 3 architecture study
│   ├── Mistral/              # Mistral architecture analysis
│   └── Qwen3/                # Qwen 3 architecture study
└── turkish_tiktokenizer/     # Custom Turkish morphological tokenizer
    ├── app.py                # Gradio demo interface
    └── README.md             # Tokenizer documentation

Installation

Requirements

Python 3.8+
PyTorch 2.0+
CUDA-capable GPU (tested on RTX 5070 12GB)
16GB+ system RAM recommended

Setup

# Clone the repository
git clone https://github.com/yourusername/ismail.git
cd ismail

# Install dependencies
cd Model_Architecture
pip install -r requirements.txt

# Optional: Install W&B for experiment tracking
pip install wandb

# Optional: Install bitsandbytes for 8-bit Adam optimizer
pip install bitsandbytes

Usage

Training

cd Model_Architecture

# Train with default config
python train.py

# Train with custom config
python train.py --config config.json

# Resume from checkpoint
python train.py --resume checkpoints/step_10000.pt

Training Features:

Gradient accumulation for effective larger batch sizes
Expert rotation for memory-efficient MoE training
Mixed precision training (FP32/BF16/FP8)
Automatic checkpointing
W&B integration for tracking
Validation during training

Generation

# Generate text
python generation.py --checkpoint checkpoints/latest.pt --prompt "Your prompt here"

Model Configuration

Edit config.json to customize:

Model architecture (dimensions, layers, experts)
Training hyperparameters (learning rate, batch size)
Data paths and tokenizer
Logging and checkpointing

Turkish Language Support

ismail uses a custom hybrid tokenizer specifically designed for Turkish:

Morphological Awareness: Understands Turkish word structure (roots + suffixes)
Efficient Encoding: 32K vocabulary with ~3.5x compression ratio
Linguistic Preservation: Maintains grammatical information in token boundaries
Research-Based: Implements hybrid approach from arXiv:2508.14292

The tokenizer handles Turkish's rich morphology better than standard BPE, preserving linguistic meaning while maintaining vocabulary efficiency. See turkish_tiktokenizer/README.md for details.

Key Features for Low-End Hardware

Sequential Expert Training: Train one expert at a time to fit in 12GB VRAM
Gradient Checkpointing: Trade compute for memory
8-bit Optimizer: bitsandbytes Adam optimizer reduces memory by ~40%
Small Batch Training: Gradient accumulation enables large effective batch sizes
FP8 Inference: Custom kernels for efficient inference
Flexible Configuration: Easy to scale down for smaller GPUs

Inspiration & References

This project draws heavily from:

DeepSeek-V3: MLA and MoE architecture
LLMs-from-scratch: Educational foundation and best practices
GPT-2/3: Transformer baseline architecture
Llama 3: RoPE and normalization techniques

Technical Details

Multi-Head Latent Attention (MLA)

The MLA mechanism compresses KV cache using low-rank projections:

Query: Standard multi-head projection
Key/Value: Compressed via LoRA-style down/up projection
Split heads: RoPE-enabled (64d) + Non-RoPE (128d)
Memory savings: ~4x reduction in KV cache size

Mixture of Experts (MoE)

Top-K routing (K=2) with learned router
Shared experts for common features
Load balancing loss to prevent expert collapse
Sequential training mode for VRAM constraints

YaRN Positional Encoding

Extends context beyond training length
Smooth frequency interpolation
Maintains performance on short sequences
Configurable extrapolation factors

Current Status & Roadmap

Current:

✅ Core architecture implemented
✅ Training pipeline functional
✅ Custom Turkish morphological tokenizer
✅ Turkish dataset preparation
🔄 Pretraining on Turkish text with single 5070 (ongoing)

Planned:

Complete initial pretraining run
Evaluation on Turkish benchmarks (TurkishBench, etc.)
Fine-tuning pipeline for instruction following
Model release (if not too lame!)
Multi-GPU training support
Inference optimization and quantization

Performance

Training on RTX 5070 (12GB):

512-dim model: ~3.5 tokens/sec with batch_size=16, grad_accum=8
Memory usage: ~11.5GB during training
Estimated pretraining: Several weeks for 100K steps

Performance will improve significantly with better hardware!

Acknowledgments

Special thanks to:

DeepSeek AI for the innovative MLA and MoE architectures
Sebastian Raschka for the excellent LLMs-from-scratch educational resource
The broader open-source LLM community for making this possible

Contributing

This is primarily a learning project, but suggestions and feedback are welcome! Feel free to open issues or PRs.

Contact

For questions or discussions, please open an issue on GitHub.

Built with determination and limited VRAM 🚀

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train ikaganacar/ismail

Evaluation results

Metadata error: specify a dataset to view leaderboard