File size: 3,480 Bytes
0887956 d5308a6 0887956 d5308a6 feba8db 0887956 23256f7 feba8db 0887956 feba8db 0887956 feba8db b4a6a43 feba8db b4a6a43 feba8db 3998462 feba8db 3998462 682403f feba8db 3998462 feba8db 23256f7 feba8db 23256f7 feba8db 3998462 feba8db 23256f7 feba8db 23256f7 feba8db |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
language:
- en
license: apache-2.0
library_name: transformers
base_model:
- mistralai/Mistral-Nemo-Base-2407 # lightweight student
- Qwen/Qwen3-235B-A22B # thinking + non-thinking teacher
tags:
- distillation
- /think
- /nothink
- reasoning-transfer
- arcee-ai
---

# Arcee **Homunculus-12B**
**Homunculus** is a 12 billion-parameter instruction model distilled from **Qwen3-235B** onto the **Mistral-Nemo** backbone.
It was purpose-built to preserve Qwen’s two-mode interaction style—`/think` (deliberate chain-of-thought) and `/nothink` (concise answers)—while running on a single consumer GPU.
---
## ✨ What’s special?
| Feature | Detail |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Reasoning-trace transfer** | Instead of copying just final probabilities, we align *full* logit trajectories, yielding more faithful reasoning. |
| **Total-Variation-Distance loss** | To better match the teacher’s confidence distribution and smooth the loss landscape. |
| **Tokenizer replacement** | The original Mistral tokenizer was swapped for Qwen3's tokenizer. |
| **Dual interaction modes** | Use `/think` when you want transparent step-by-step reasoning (good for analysis & debugging). Use `/nothink` for terse, production-ready answers. Most reliable in the system role field. | |
---
## Benchmark results
| Benchmark | Score |
| --------- | ----- |
| GPQADiamond (average of 3) | 57.1% |
| mmlu | 67.5% |
## 🔧 Quick Start
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "arcee-ai/Homunculus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto"
)
# /think mode - Chain-of-thought reasoning
messages = [
{"role": "system", "content": "You are a helpful assistant. /think"},
{"role": "user", "content": "Why is the sky blue?"},
]
output = model.generate(
tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
max_new_tokens=512,
temperature=0.7
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# /nothink mode - Direct answers
messages = [
{"role": "system", "content": "You are a helpful assistant. /nothink"},
{"role": "user", "content": "Summarize the plot of Hamlet in two sentences."},
]
output = model.generate(
tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
max_new_tokens=128,
temperature=0.7
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```
## 💡 Intended Use & Limitations
Homunculus is designed for:
* **Research** on reasoning-trace distillation, Logit Imitation, and mode-switchable assistants.
* **Lightweight production** deployments that need strong reasoning at <12 GB VRAM.
### Known limitations
* May inherit biases from the Qwen3 teacher and internet-scale pretraining data.
* Long-context (>32 k tokens) use is experimental—expect latency & memory overhead.
---
|