theprint-12B-MoE-3A

An experimental Mixture of Experts (MoE) model combining four specialized Llama 3.2 3B fine-tunes into a single 12B parameter model with ~3B active parameters per token.

Overview

This is my first attempt at building an MoE model. Rather than merging random models, I combined four of my own fine-tunes, each with distinct areas of expertise, to create a model that routes queries to the most appropriate specialist.

Total Parameters: ~12B
Active Parameters: ~3B per token
Architecture: Mixtral-style MoE with hidden gate routing
Base Model: Llama 3.2 3B Instruct

Experts

The model contains four specialized experts:

ReasonableMath - Mathematical reasoning and equation solving
- Handles algebraic problems, equation solving, structured math tasks
- Note: Better with formal equations than raw arithmetic
ReWiz - Logical reasoning and analysis
- Evaluates arguments, analyzes problems step-by-step
- Focuses on logical reasoning rather than computation
VanRossum - Code generation and debugging
- Python programming, algorithm implementation, error debugging
- Named after Guido van Rossum
Empathetic - Emotional intelligence and relationship guidance
- Supportive, empathetic responses for personal/relationship concerns
- Uses validating language and practical advice

How It Works

The model uses a hidden-state router to automatically select which expert handles each query based on the input. The router was trained on positive prompt patterns for each expert during the merge process.

Expected routing behavior:

Math equations → ReasonableMath
Logic puzzles → ReWiz
Code requests → VanRossum
Personal concerns → Empathetic
General queries → Routes to closest specialist

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "theprint/theprint-12B-MoE-3A",
    device_map="auto",
    torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained("theprint/theprint-12B-MoE-3A")

prompt = "Solve this equation: 3x - 7 = 14"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF Compatibility

This model uses 4 experts (power of 2) and is compatible with llama.cpp for GGUF conversion and local inference. You can find prepared GGUF-files at theprint/theprint-12B-MoE-3A-GGUF.

Recommended quantizations:

Q4_K_M - Best balance of quality and size
Q8_0 - Higher quality, larger files
F16 - Full precision (if you have VRAM to spare)

Known Limitations

This is an experimental first build. Expect quirks:

Routing isn't perfect - The model sometimes picks unexpected experts for edge cases
Simple arithmetic may not route optimally - ReasonableMath prefers structured problems
No explicit general knowledge expert - General queries route to whichever specialist seems closest
Multi-turn consistency - Expert switching mid-conversation hasn't been extensively tested

Technical Details

Merge Tool: mergekit (mixtral method)
Gate Mode: hidden (router trained on hidden states)
Data Type: bfloat16
Created: November 2025

Training Details

The individual experts were fine-tuned separately on domain-specific datasets before being combined into this MoE architecture. No additional training was performed on the combined model - routing behavior is determined by the mergekit process based on positive prompt patterns.

License

Follows the Llama 3.2 community license from Meta.

Acknowledgments

Built using mergekit by Arcee AI.

Downloads last month: 22

Safetensors

Model size

10B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for theprint/theprint-12B-MoE-3A

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

(715)

this model

Quantizations

3 models