File size: 3,480 Bytes
0887956
d5308a6
 
 
0887956
d5308a6
feba8db
 
 
 
 
 
 
 
0887956
 
23256f7
 
feba8db
0887956
feba8db
 
0887956
feba8db
b4a6a43
feba8db
b4a6a43
feba8db
 
 
 
 
 
3998462
feba8db
3998462
682403f
 
 
 
 
 
 
feba8db
3998462
feba8db
 
23256f7
feba8db
 
 
 
 
 
 
23256f7
feba8db
 
 
 
3998462
feba8db
 
 
 
 
 
23256f7
feba8db
 
 
 
 
 
 
 
 
 
 
23256f7
feba8db
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
language:
- en
license: apache-2.0
library_name: transformers
base_model:
  - mistralai/Mistral-Nemo-Base-2407   # lightweight student
  - Qwen/Qwen3-235B-A22B              # thinking + non-thinking teacher
tags:
- distillation
- /think
- /nothink
- reasoning-transfer
- arcee-ai
---

![Homunculus Logo](https://huggingface.co/arcee-ai/Homunculus/resolve/main/logo.jpg)

# Arcee **Homunculus-12B**

**Homunculus** is a 12 billion-parameter instruction model distilled from **Qwen3-235B** onto the **Mistral-Nemo** backbone.
It was purpose-built to preserve Qwen’s two-mode interaction style—`/think` (deliberate chain-of-thought) and `/nothink` (concise answers)—while running on a single consumer GPU.

---

## ✨ What’s special?

| Feature                           | Detail                                                                                                                                               |
| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Reasoning-trace transfer**      | Instead of copying just final probabilities, we align *full* logit trajectories, yielding more faithful reasoning.        |
| **Total-Variation-Distance loss** | To better match the teacher’s confidence distribution and smooth the loss landscape. |
| **Tokenizer replacement**         | The original Mistral tokenizer was swapped for Qwen3's tokenizer.                          |
| **Dual interaction modes**        | Use `/think` when you want transparent step-by-step reasoning (good for analysis & debugging). Use `/nothink` for terse, production-ready answers. Most reliable in the system role field.   |                    |

---

## Benchmark results

| Benchmark | Score |
| --------- | ----- |
| GPQADiamond (average of 3) | 57.1% |
| mmlu | 67.5% |

## 🔧 Quick Start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "arcee-ai/Homunculus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto",
    device_map="auto"
)

# /think mode - Chain-of-thought reasoning
messages = [
    {"role": "system", "content": "You are a helpful assistant. /think"},
    {"role": "user", "content": "Why is the sky blue?"},
]
output = model.generate(
    tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
    max_new_tokens=512,
    temperature=0.7
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

# /nothink mode - Direct answers
messages = [
    {"role": "system", "content": "You are a helpful assistant. /nothink"},
    {"role": "user", "content": "Summarize the plot of Hamlet in two sentences."},
]
output = model.generate(
    tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt"),
    max_new_tokens=128,
    temperature=0.7
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
```

## 💡 Intended Use & Limitations

Homunculus is designed for:

* **Research** on reasoning-trace distillation, Logit Imitation, and mode-switchable assistants.
* **Lightweight production** deployments that need strong reasoning at <12 GB VRAM.

### Known limitations

* May inherit biases from the Qwen3 teacher and internet-scale pretraining data.
* Long-context (>32 k tokens) use is experimental—expect latency & memory overhead.

---