---
license: apache-2.0
base_model:
- Writer/palmyra-mini-thinking-a
tags:
- gguf
- qwen2
- palmyra
- thinking
- reasoning
- quantized
---

# Palmyra Mini Thinking A - GGUF

## Model Description

This repository contains GGUF quantized versions of the [palmyra-mini-thinking-a model](https://huggingface.co/Writer/palmyra-mini-thinking-a), based on the Qwen2 architecture. This model is specifically designed for reasoning tasks with explicit thinking capabilities through special `<think>` and `</think>` tokens. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.

## Available Quantizations

### BF16 (Brain Float 16)
- **File**: `palmyra-mini-thinking-a-BF16.gguf`
- **Size**: 3.3GB
- **Precision**: 16-bit brain float
- **Use Case**: Highest quality reasoning, requires more memory

### Q8_0 (8-bit Quantization)
- **File**: `palmyra-mini-thinking-a-Q8_0.gguf`
- **Size**: 1.8GB
- **Precision**: 8-bit integer
- **Use Case**: Good balance of reasoning quality and efficiency

## Quick Start

### Installation

```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or use a pre-built binary
```

### Usage

```bash
# Run with thinking prompt
./main -m /path/to/palmyra-mini-thinking-a-BF16.gguf \
  -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<｜Assistant｜><think>" \
  -n 512

# Interactive mode
./main -m /path/to/palmyra-mini-thinking-a-Q8_0.gguf -i
```
## LM Studio Use
Steps to download a model through the **Discover** tab can be found [here](https://lmstudio.ai/docs/app/basics/download-model)

### Ollama Use
Please see [the guide in this repo](https://huggingface.co/Writer/palmyra-mini-thinking-a-GGUF/resolve/main/ollama-README-A.md?download=true) for steps on how to load this model into Ollama


## Technical Specifications

### Model Architecture
- **Model Type**: `qwen2` (Qwen2 Architecture)
- **Architecture**: `Qwen2ForCausalLM`
- **Parameters**: ~1.7 billion parameters
- **Base Precision**: bfloat16
- **Specialization**: Reasoning and thinking tasks

### Core Parameters
| Parameter | Value |
|-----------|-------|
| Hidden Size | 1,536 |
| Intermediate Size | 8,960 |
| Number of Layers | 28 |
| Attention Heads | 12 |
| Key-Value Heads | 2 |
| Head Dimension | 128 |
| Vocabulary Size | 151,665 |

### Attention Mechanism
- **Attention Type**: Full attention across all 28 layers
- **Max Position Embeddings**: 131,072 tokens
- **Context Length**: 4,096 tokens (default)
- **Sliding Window**: Not used

### Thinking Capabilities
- **Thinking Tokens**: `<think>` (151648) and `</think>` (151649)
- **Reasoning Mode**: Explicit step-by-step reasoning
- **Special Features**: Designed for chain-of-thought reasoning

### Quantization Comparison
| Format | Size | Precision | Reasoning Quality | Speed | Memory |
|--------|------|-----------|-------------------|-------|--------|
| BF16   | 3.3GB| 16-bit    | Highest          | Slower| High   |
| Q8_0   | 1.8GB| 8-bit     | High             | Faster| Medium |

### File Structure
```
palmyra-mini-thinking-a/GGUF/
├── palmyra-mini-thinking-a-BF16.gguf    # BF16 quantization
└── palmyra-mini-thinking-a-Q8_0.gguf    # Q8_0 quantization
```

## Performance Characteristics

### Hardware Requirements
- **CPU**: Modern x86_64 or ARM64 processor
- **Memory**: 
  - BF16: 4GB+ RAM recommended
  - Q8_0: 3GB+ RAM recommended
- **Platform**: Cross-platform (Windows, macOS, Linux)

### Inference Performance
- **BF16**: Highest reasoning quality, slower inference
- **Q8_0**: ~45% smaller size, faster inference with preserved reasoning capabilities

## Training Details

### Tokenizer
- **Type**: LlamaTokenizerFast with 151,665 vocabulary size
- **Special Tokens**:
  - BOS Token ID: 151646 (`
`)
  - EOS Token ID: 151643 (`
`)
  - Pad Token ID: 151643 (`
`)
  - Think Start: 151648 (`<think>`)
  - Think End: 151649 (`</think>`)

### Model Configuration
- **Hidden Activation**: SiLU (Swish)
- **Normalization**: RMSNorm (ε = 1e-06)
- **Initializer Range**: 0.02
- **Attention Dropout**: 0.0

### Chat Template
The model uses a specialized chat template for reasoning:
- User messages: `
`
- Assistant messages: `
`
- Thinking mode: Automatically initiated with `<think>` tokens
- Tool calling support

## Usage Examples

### Reasoning Task
```bash
./main -m palmyra-mini-thinking-a-Q8_0.gguf \
  -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<｜Assistant｜><think>" \
  -n 300 \
  --temp 0.7
```

### Problem Solving
```bash
./main -m palmyra-mini-thinking-a-BF16.gguf \
  -p "Explain the water cycle step by step.<｜Assistant｜><think>" \
  -n 400 \
  --temp 0.8 \
  --top-p 0.9
```


## Known Limitations

1. **Context Length**: Default context is 4,096 tokens, though the model supports up to 131,072
2. **Thinking Overhead**: Explicit thinking increases response length and generation time
3. **Quantization Trade-offs**: Lower bit quantizations may affect reasoning quality
4. **Platform Optimization**: Performance varies across different hardware configurations

## Compatibility

- **llama.cpp**: Compatible with recent versions
- **Frameworks**: llama.cpp, Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
- **Platforms**: Windows, macOS, Linux (x86_64, ARM64)
- **Special Features**: Requires framework support for thinking tokens

## License

Apache 2.0

---

# Original model card: palmyra-mini-thinking-a

## Model Details

**Model Name:** palmyra-mini-thinking-a

**Version:** 1.0

**Type:** Generative AI Language Model


## Model Description

The palmyra-mini-thinking-a model demonstrates exceptional performance in advanced mathematical reasoning and competitive programming. Its capabilities are highlighted by an outstanding score of 0.886 on the 'MATH500' benchmark, showcasing a robust ability to solve complex mathematical problems. The strength of the model in quantitative challenges is further confirmed by its score of 0.8287 on 'gsm8k (strict-match)', which demonstrates proficiency in multi-step arithmetic reasoning. Additionally, the model proves its aptitude for high-level problem-solving with a score of 0.8 on 'AMC23'. The model also shows strong potential in the coding domain, achieving a score of 0.5631 on 'Codeforces (pass_rate)' and 0.5481 on 'Olympiadbench (extractive_match)', indicating competence in generating correct solutions for programming challenges.

## Benchmark Performance

This section provides a detailed breakdown of the palmyra-mini-thinking-a model's performance across a standardized set of industry benchmarks. The data is presented in its original order from the source evaluation.

| Benchmark                                                        |    Score |
|:-----------------------------------------------------------------|---------:|
| gsm8k (strict-match)                                             | 0.8287   |
| minerva_math(exact_match)                                        | 0.3842   |
| mmlu_pro(exact_match)                                            | 0.2748   |
| hendrycks_math                                                   | 0.0054   |
| ifeval (inst_level_loose_acc)                                    | 0.3657   |
| mathqa (acc)                                                     | 0.4171   |
| humaneval (pass@1)                                               | 0.2378   |
| BBH (get-answer)(exact_match)                                    | 0.462    |
| mbpp                                                             | 0.304    |
| leadboard_musr (acc_norm)                                        | 0.3413   |
| gpqa  lighteval gpqa diamond_pass@1:8_samples                    | 0.3826   |
| AIME24(pass@1)(avg-of-1)                                         | 0.4333   |
| AIME25(pass@1)(avg-of-1)                                         | 0.3667   |
| Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.1784   |
| AMC23                                                            | 0.8      |
| MATH500                                                          | 0.886    |
| Minerva                                                          | 0.3493   |
| Olympiadbench (extractive_match)                                 | 0.5481   |
| Codecontests (pass_rate)                                         | 0.1778   |
| Codeforces (pass_rate)                                           | 0.5631   |
| Taco (pass_rate)                                                 | 0.3083   |
| APPS (all_levels)                                                | 0.0447   |
| HMMT23 (extractive_match)                                        | 0.1      |
| Average                                                          | 0.380839 |

## Intended Use

This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.

## Limitations

The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.

## Ethical Considerations

As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.