---
license: apache-2.0
base_model:
- Writer/palmyra-mini-thinking-a
tags:
- gguf
- qwen2
- palmyra
- thinking
- reasoning
- quantized
---
# Palmyra Mini Thinking A - GGUF
## Model Description
This repository contains GGUF quantized versions of the [palmyra-mini-thinking-a model](https://huggingface.co/Writer/palmyra-mini-thinking-a), based on the Qwen2 architecture. This model is specifically designed for reasoning tasks with explicit thinking capabilities through special `` and `` tokens. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.
## Available Quantizations
### BF16 (Brain Float 16)
- **File**: `palmyra-mini-thinking-a-BF16.gguf`
- **Size**: 3.3GB
- **Precision**: 16-bit brain float
- **Use Case**: Highest quality reasoning, requires more memory
### Q8_0 (8-bit Quantization)
- **File**: `palmyra-mini-thinking-a-Q8_0.gguf`
- **Size**: 1.8GB
- **Precision**: 8-bit integer
- **Use Case**: Good balance of reasoning quality and efficiency
## Quick Start
### Installation
```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Or use a pre-built binary
```
### Usage
```bash
# Run with thinking prompt
./main -m /path/to/palmyra-mini-thinking-a-BF16.gguf \
-p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|>" \
-n 512
# Interactive mode
./main -m /path/to/palmyra-mini-thinking-a-Q8_0.gguf -i
```
## LM Studio Use
Steps to download a model through the **Discover** tab can be found [here](https://lmstudio.ai/docs/app/basics/download-model)
### Ollama Use
Please see [the guide in this repo](https://huggingface.co/Writer/palmyra-mini-thinking-a-GGUF/resolve/main/ollama-README-A.md?download=true) for steps on how to load this model into Ollama
## Technical Specifications
### Model Architecture
- **Model Type**: `qwen2` (Qwen2 Architecture)
- **Architecture**: `Qwen2ForCausalLM`
- **Parameters**: ~1.7 billion parameters
- **Base Precision**: bfloat16
- **Specialization**: Reasoning and thinking tasks
### Core Parameters
| Parameter | Value |
|-----------|-------|
| Hidden Size | 1,536 |
| Intermediate Size | 8,960 |
| Number of Layers | 28 |
| Attention Heads | 12 |
| Key-Value Heads | 2 |
| Head Dimension | 128 |
| Vocabulary Size | 151,665 |
### Attention Mechanism
- **Attention Type**: Full attention across all 28 layers
- **Max Position Embeddings**: 131,072 tokens
- **Context Length**: 4,096 tokens (default)
- **Sliding Window**: Not used
### Thinking Capabilities
- **Thinking Tokens**: `` (151648) and `` (151649)
- **Reasoning Mode**: Explicit step-by-step reasoning
- **Special Features**: Designed for chain-of-thought reasoning
### Quantization Comparison
| Format | Size | Precision | Reasoning Quality | Speed | Memory |
|--------|------|-----------|-------------------|-------|--------|
| BF16 | 3.3GB| 16-bit | Highest | Slower| High |
| Q8_0 | 1.8GB| 8-bit | High | Faster| Medium |
### File Structure
```
palmyra-mini-thinking-a/GGUF/
├── palmyra-mini-thinking-a-BF16.gguf # BF16 quantization
└── palmyra-mini-thinking-a-Q8_0.gguf # Q8_0 quantization
```
## Performance Characteristics
### Hardware Requirements
- **CPU**: Modern x86_64 or ARM64 processor
- **Memory**:
- BF16: 4GB+ RAM recommended
- Q8_0: 3GB+ RAM recommended
- **Platform**: Cross-platform (Windows, macOS, Linux)
### Inference Performance
- **BF16**: Highest reasoning quality, slower inference
- **Q8_0**: ~45% smaller size, faster inference with preserved reasoning capabilities
## Training Details
### Tokenizer
- **Type**: LlamaTokenizerFast with 151,665 vocabulary size
- **Special Tokens**:
- BOS Token ID: 151646 (`
`)
- EOS Token ID: 151643 (`
`)
- Pad Token ID: 151643 (`
`)
- Think Start: 151648 (``)
- Think End: 151649 (``)
### Model Configuration
- **Hidden Activation**: SiLU (Swish)
- **Normalization**: RMSNorm (ε = 1e-06)
- **Initializer Range**: 0.02
- **Attention Dropout**: 0.0
### Chat Template
The model uses a specialized chat template for reasoning:
- User messages: `
`
- Assistant messages: `
`
- Thinking mode: Automatically initiated with `` tokens
- Tool calling support
## Usage Examples
### Reasoning Task
```bash
./main -m palmyra-mini-thinking-a-Q8_0.gguf \
-p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|>" \
-n 300 \
--temp 0.7
```
### Problem Solving
```bash
./main -m palmyra-mini-thinking-a-BF16.gguf \
-p "Explain the water cycle step by step.<|Assistant|>" \
-n 400 \
--temp 0.8 \
--top-p 0.9
```
## Known Limitations
1. **Context Length**: Default context is 4,096 tokens, though the model supports up to 131,072
2. **Thinking Overhead**: Explicit thinking increases response length and generation time
3. **Quantization Trade-offs**: Lower bit quantizations may affect reasoning quality
4. **Platform Optimization**: Performance varies across different hardware configurations
## Compatibility
- **llama.cpp**: Compatible with recent versions
- **Frameworks**: llama.cpp, Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
- **Platforms**: Windows, macOS, Linux (x86_64, ARM64)
- **Special Features**: Requires framework support for thinking tokens
## License
Apache 2.0
---
# Original model card: palmyra-mini-thinking-a
## Model Details
**Model Name:** palmyra-mini-thinking-a
**Version:** 1.0
**Type:** Generative AI Language Model
## Model Description
The palmyra-mini-thinking-a model demonstrates exceptional performance in advanced mathematical reasoning and competitive programming. Its capabilities are highlighted by an outstanding score of 0.886 on the 'MATH500' benchmark, showcasing a robust ability to solve complex mathematical problems. The strength of the model in quantitative challenges is further confirmed by its score of 0.8287 on 'gsm8k (strict-match)', which demonstrates proficiency in multi-step arithmetic reasoning. Additionally, the model proves its aptitude for high-level problem-solving with a score of 0.8 on 'AMC23'. The model also shows strong potential in the coding domain, achieving a score of 0.5631 on 'Codeforces (pass_rate)' and 0.5481 on 'Olympiadbench (extractive_match)', indicating competence in generating correct solutions for programming challenges.
## Benchmark Performance
This section provides a detailed breakdown of the palmyra-mini-thinking-a model's performance across a standardized set of industry benchmarks. The data is presented in its original order from the source evaluation.
| Benchmark | Score |
|:-----------------------------------------------------------------|---------:|
| gsm8k (strict-match) | 0.8287 |
| minerva_math(exact_match) | 0.3842 |
| mmlu_pro(exact_match) | 0.2748 |
| hendrycks_math | 0.0054 |
| ifeval (inst_level_loose_acc) | 0.3657 |
| mathqa (acc) | 0.4171 |
| humaneval (pass@1) | 0.2378 |
| BBH (get-answer)(exact_match) | 0.462 |
| mbpp | 0.304 |
| leadboard_musr (acc_norm) | 0.3413 |
| gpqa lighteval gpqa diamond_pass@1:8_samples | 0.3826 |
| AIME24(pass@1)(avg-of-1) | 0.4333 |
| AIME25(pass@1)(avg-of-1) | 0.3667 |
| Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.1784 |
| AMC23 | 0.8 |
| MATH500 | 0.886 |
| Minerva | 0.3493 |
| Olympiadbench (extractive_match) | 0.5481 |
| Codecontests (pass_rate) | 0.1778 |
| Codeforces (pass_rate) | 0.5631 |
| Taco (pass_rate) | 0.3083 |
| APPS (all_levels) | 0.0447 |
| HMMT23 (extractive_match) | 0.1 |
| Average | 0.380839 |
## Intended Use
This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.
## Limitations
The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.
## Ethical Considerations
As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.