File size: 9,375 Bytes
9500a6f
f9a321e
9500a6f
 
f9a321e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35d6473
61af896
35d6473
 
61af896
35d6473
f9a321e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
license: apache-2.0
base_model:
- Writer/palmyra-mini-thinking-a
tags:
- gguf
- qwen2
- palmyra
- thinking
- reasoning
- quantized
---

# Palmyra Mini Thinking A - GGUF

## Model Description

This repository contains GGUF quantized versions of the [palmyra-mini-thinking-a model](https://huggingface.co/Writer/palmyra-mini-thinking-a), based on the Qwen2 architecture. This model is specifically designed for reasoning tasks with explicit thinking capabilities through special `<think>` and `</think>` tokens. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.

## Available Quantizations

### BF16 (Brain Float 16)
- **File**: `palmyra-mini-thinking-a-BF16.gguf`
- **Size**: 3.3GB
- **Precision**: 16-bit brain float
- **Use Case**: Highest quality reasoning, requires more memory

### Q8_0 (8-bit Quantization)
- **File**: `palmyra-mini-thinking-a-Q8_0.gguf`
- **Size**: 1.8GB
- **Precision**: 8-bit integer
- **Use Case**: Good balance of reasoning quality and efficiency

## Quick Start

### Installation

```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or use a pre-built binary
```

### Usage

```bash
# Run with thinking prompt
./main -m /path/to/palmyra-mini-thinking-a-BF16.gguf \
  -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|><think>" \
  -n 512

# Interactive mode
./main -m /path/to/palmyra-mini-thinking-a-Q8_0.gguf -i
```
## LM Studio Use
Steps to download a model through the **Discover** tab can be found [here](https://lmstudio.ai/docs/app/basics/download-model)

### Ollama Use
Please see [the guide in this repo](https://huggingface.co/Writer/palmyra-mini-thinking-a-GGUF/resolve/main/ollama-README-A.md?download=true) for steps on how to load this model into Ollama


## Technical Specifications

### Model Architecture
- **Model Type**: `qwen2` (Qwen2 Architecture)
- **Architecture**: `Qwen2ForCausalLM`
- **Parameters**: ~1.7 billion parameters
- **Base Precision**: bfloat16
- **Specialization**: Reasoning and thinking tasks

### Core Parameters
| Parameter | Value |
|-----------|-------|
| Hidden Size | 1,536 |
| Intermediate Size | 8,960 |
| Number of Layers | 28 |
| Attention Heads | 12 |
| Key-Value Heads | 2 |
| Head Dimension | 128 |
| Vocabulary Size | 151,665 |

### Attention Mechanism
- **Attention Type**: Full attention across all 28 layers
- **Max Position Embeddings**: 131,072 tokens
- **Context Length**: 4,096 tokens (default)
- **Sliding Window**: Not used

### Thinking Capabilities
- **Thinking Tokens**: `<think>` (151648) and `</think>` (151649)
- **Reasoning Mode**: Explicit step-by-step reasoning
- **Special Features**: Designed for chain-of-thought reasoning

### Quantization Comparison
| Format | Size | Precision | Reasoning Quality | Speed | Memory |
|--------|------|-----------|-------------------|-------|--------|
| BF16   | 3.3GB| 16-bit    | Highest          | Slower| High   |
| Q8_0   | 1.8GB| 8-bit     | High             | Faster| Medium |

### File Structure
```
palmyra-mini-thinking-a/GGUF/
├── palmyra-mini-thinking-a-BF16.gguf    # BF16 quantization
└── palmyra-mini-thinking-a-Q8_0.gguf    # Q8_0 quantization
```

## Performance Characteristics

### Hardware Requirements
- **CPU**: Modern x86_64 or ARM64 processor
- **Memory**: 
  - BF16: 4GB+ RAM recommended
  - Q8_0: 3GB+ RAM recommended
- **Platform**: Cross-platform (Windows, macOS, Linux)

### Inference Performance
- **BF16**: Highest reasoning quality, slower inference
- **Q8_0**: ~45% smaller size, faster inference with preserved reasoning capabilities

## Training Details

### Tokenizer
- **Type**: LlamaTokenizerFast with 151,665 vocabulary size
- **Special Tokens**:
  - BOS Token ID: 151646 (`
`)
  - EOS Token ID: 151643 (`
`)
  - Pad Token ID: 151643 (`
`)
  - Think Start: 151648 (`<think>`)
  - Think End: 151649 (`</think>`)

### Model Configuration
- **Hidden Activation**: SiLU (Swish)
- **Normalization**: RMSNorm (ε = 1e-06)
- **Initializer Range**: 0.02
- **Attention Dropout**: 0.0

### Chat Template
The model uses a specialized chat template for reasoning:
- User messages: `
`
- Assistant messages: `
`
- Thinking mode: Automatically initiated with `<think>` tokens
- Tool calling support

## Usage Examples

### Reasoning Task
```bash
./main -m palmyra-mini-thinking-a-Q8_0.gguf \
  -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|><think>" \
  -n 300 \
  --temp 0.7
```

### Problem Solving
```bash
./main -m palmyra-mini-thinking-a-BF16.gguf \
  -p "Explain the water cycle step by step.<|Assistant|><think>" \
  -n 400 \
  --temp 0.8 \
  --top-p 0.9
```


## Known Limitations

1. **Context Length**: Default context is 4,096 tokens, though the model supports up to 131,072
2. **Thinking Overhead**: Explicit thinking increases response length and generation time
3. **Quantization Trade-offs**: Lower bit quantizations may affect reasoning quality
4. **Platform Optimization**: Performance varies across different hardware configurations

## Compatibility

- **llama.cpp**: Compatible with recent versions
- **Frameworks**: llama.cpp, Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
- **Platforms**: Windows, macOS, Linux (x86_64, ARM64)
- **Special Features**: Requires framework support for thinking tokens

## License

Apache 2.0

---

# Original model card: palmyra-mini-thinking-a

## Model Details

**Model Name:** palmyra-mini-thinking-a

**Version:** 1.0

**Type:** Generative AI Language Model


## Model Description

The palmyra-mini-thinking-a model demonstrates exceptional performance in advanced mathematical reasoning and competitive programming. Its capabilities are highlighted by an outstanding score of 0.886 on the 'MATH500' benchmark, showcasing a robust ability to solve complex mathematical problems. The strength of the model in quantitative challenges is further confirmed by its score of 0.8287 on 'gsm8k (strict-match)', which demonstrates proficiency in multi-step arithmetic reasoning. Additionally, the model proves its aptitude for high-level problem-solving with a score of 0.8 on 'AMC23'. The model also shows strong potential in the coding domain, achieving a score of 0.5631 on 'Codeforces (pass_rate)' and 0.5481 on 'Olympiadbench (extractive_match)', indicating competence in generating correct solutions for programming challenges.

## Benchmark Performance

This section provides a detailed breakdown of the palmyra-mini-thinking-a model's performance across a standardized set of industry benchmarks. The data is presented in its original order from the source evaluation.

| Benchmark                                                        |    Score |
|:-----------------------------------------------------------------|---------:|
| gsm8k (strict-match)                                             | 0.8287   |
| minerva_math(exact_match)                                        | 0.3842   |
| mmlu_pro(exact_match)                                            | 0.2748   |
| hendrycks_math                                                   | 0.0054   |
| ifeval (inst_level_loose_acc)                                    | 0.3657   |
| mathqa (acc)                                                     | 0.4171   |
| humaneval (pass@1)                                               | 0.2378   |
| BBH (get-answer)(exact_match)                                    | 0.462    |
| mbpp                                                             | 0.304    |
| leadboard_musr (acc_norm)                                        | 0.3413   |
| gpqa  lighteval gpqa diamond_pass@1:8_samples                    | 0.3826   |
| AIME24(pass@1)(avg-of-1)                                         | 0.4333   |
| AIME25(pass@1)(avg-of-1)                                         | 0.3667   |
| Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.1784   |
| AMC23                                                            | 0.8      |
| MATH500                                                          | 0.886    |
| Minerva                                                          | 0.3493   |
| Olympiadbench (extractive_match)                                 | 0.5481   |
| Codecontests (pass_rate)                                         | 0.1778   |
| Codeforces (pass_rate)                                           | 0.5631   |
| Taco (pass_rate)                                                 | 0.3083   |
| APPS (all_levels)                                                | 0.0447   |
| HMMT23 (extractive_match)                                        | 0.1      |
| Average                                                          | 0.380839 |

## Intended Use

This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.

## Limitations

The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.

## Ethical Considerations

As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.