theprint-12B-MoE-3A

An experimental Mixture of Experts (MoE) model combining four specialized Llama 3.2 3B fine-tunes into a single 12B parameter model with ~3B active parameters per token.

Overview

This is my first attempt at building an MoE model. Rather than merging random models, I combined four of my own fine-tunes, each with distinct areas of expertise, to create a model that routes queries to the most appropriate specialist.

Total Parameters: ~12B
Active Parameters: ~3B per token
Architecture: Mixtral-style MoE with hidden gate routing
Base Model: Llama 3.2 3B Instruct

Experts

The model contains four specialized experts:

  1. ReasonableMath - Mathematical reasoning and equation solving

    • Handles algebraic problems, equation solving, structured math tasks
    • Note: Better with formal equations than raw arithmetic
  2. ReWiz - Logical reasoning and analysis

    • Evaluates arguments, analyzes problems step-by-step
    • Focuses on logical reasoning rather than computation
  3. VanRossum - Code generation and debugging

    • Python programming, algorithm implementation, error debugging
    • Named after Guido van Rossum
  4. Empathetic - Emotional intelligence and relationship guidance

    • Supportive, empathetic responses for personal/relationship concerns
    • Uses validating language and practical advice

How It Works

The model uses a hidden-state router to automatically select which expert handles each query based on the input. The router was trained on positive prompt patterns for each expert during the merge process.

Expected routing behavior:

  • Math equations β†’ ReasonableMath
  • Logic puzzles β†’ ReWiz
  • Code requests β†’ VanRossum
  • Personal concerns β†’ Empathetic
  • General queries β†’ Routes to closest specialist

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "theprint/theprint-12B-MoE-3A",
    device_map="auto",
    torch_dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained("theprint/theprint-12B-MoE-3A")

prompt = "Solve this equation: 3x - 7 = 14"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GGUF Compatibility

This model uses 4 experts (power of 2) and is compatible with llama.cpp for GGUF conversion and local inference. You can find prepared GGUF-files at theprint/theprint-12B-MoE-3A-GGUF.

Recommended quantizations:

  • Q4_K_M - Best balance of quality and size
  • Q8_0 - Higher quality, larger files
  • F16 - Full precision (if you have VRAM to spare)

Known Limitations

This is an experimental first build. Expect quirks:

  • Routing isn't perfect - The model sometimes picks unexpected experts for edge cases
  • Simple arithmetic may not route optimally - ReasonableMath prefers structured problems
  • No explicit general knowledge expert - General queries route to whichever specialist seems closest
  • Multi-turn consistency - Expert switching mid-conversation hasn't been extensively tested

Technical Details

  • Merge Tool: mergekit (mixtral method)
  • Gate Mode: hidden (router trained on hidden states)
  • Data Type: bfloat16
  • Created: November 2025

Training Details

The individual experts were fine-tuned separately on domain-specific datasets before being combined into this MoE architecture. No additional training was performed on the combined model - routing behavior is determined by the mergekit process based on positive prompt patterns.

License

Follows the Llama 3.2 community license from Meta.

Acknowledgments

Built using mergekit by Arcee AI.

Downloads last month
22
Safetensors
Model size
10B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for theprint/theprint-12B-MoE-3A

Finetuned
(715)
this model
Quantizations
3 models