theprint-12B-MoE-3A
An experimental Mixture of Experts (MoE) model combining four specialized Llama 3.2 3B fine-tunes into a single 12B parameter model with ~3B active parameters per token.
Overview
This is my first attempt at building an MoE model. Rather than merging random models, I combined four of my own fine-tunes, each with distinct areas of expertise, to create a model that routes queries to the most appropriate specialist.
Total Parameters: ~12B
Active Parameters: ~3B per token
Architecture: Mixtral-style MoE with hidden gate routing
Base Model: Llama 3.2 3B Instruct
Experts
The model contains four specialized experts:
ReasonableMath - Mathematical reasoning and equation solving
- Handles algebraic problems, equation solving, structured math tasks
- Note: Better with formal equations than raw arithmetic
ReWiz - Logical reasoning and analysis
- Evaluates arguments, analyzes problems step-by-step
- Focuses on logical reasoning rather than computation
VanRossum - Code generation and debugging
- Python programming, algorithm implementation, error debugging
- Named after Guido van Rossum
Empathetic - Emotional intelligence and relationship guidance
- Supportive, empathetic responses for personal/relationship concerns
- Uses validating language and practical advice
How It Works
The model uses a hidden-state router to automatically select which expert handles each query based on the input. The router was trained on positive prompt patterns for each expert during the merge process.
Expected routing behavior:
- Math equations β ReasonableMath
- Logic puzzles β ReWiz
- Code requests β VanRossum
- Personal concerns β Empathetic
- General queries β Routes to closest specialist
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"theprint/theprint-12B-MoE-3A",
device_map="auto",
torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained("theprint/theprint-12B-MoE-3A")
prompt = "Solve this equation: 3x - 7 = 14"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GGUF Compatibility
This model uses 4 experts (power of 2) and is compatible with llama.cpp for GGUF conversion and local inference. You can find prepared GGUF-files at theprint/theprint-12B-MoE-3A-GGUF.
Recommended quantizations:
- Q4_K_M - Best balance of quality and size
- Q8_0 - Higher quality, larger files
- F16 - Full precision (if you have VRAM to spare)
Known Limitations
This is an experimental first build. Expect quirks:
- Routing isn't perfect - The model sometimes picks unexpected experts for edge cases
- Simple arithmetic may not route optimally - ReasonableMath prefers structured problems
- No explicit general knowledge expert - General queries route to whichever specialist seems closest
- Multi-turn consistency - Expert switching mid-conversation hasn't been extensively tested
Technical Details
- Merge Tool: mergekit (mixtral method)
- Gate Mode: hidden (router trained on hidden states)
- Data Type: bfloat16
- Created: November 2025
Training Details
The individual experts were fine-tuned separately on domain-specific datasets before being combined into this MoE architecture. No additional training was performed on the combined model - routing behavior is determined by the mergekit process based on positive prompt patterns.
License
Follows the Llama 3.2 community license from Meta.
Acknowledgments
Built using mergekit by Arcee AI.
- Downloads last month
- 22