Palmyra Mini Thinking A - GGUF

Model Description

This repository contains GGUF quantized versions of the palmyra-mini-thinking-a model, based on the Qwen2 architecture. This model is specifically designed for reasoning tasks with explicit thinking capabilities through special <think> and </think> tokens. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.

Available Quantizations

BF16 (Brain Float 16)

File: palmyra-mini-thinking-a-BF16.gguf
Size: 3.3GB
Precision: 16-bit brain float
Use Case: Highest quality reasoning, requires more memory

Q8_0 (8-bit Quantization)

File: palmyra-mini-thinking-a-Q8_0.gguf
Size: 1.8GB
Precision: 8-bit integer
Use Case: Good balance of reasoning quality and efficiency

Quick Start

Installation

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or use a pre-built binary

Usage

# Run with thinking prompt
./main -m /path/to/palmyra-mini-thinking-a-BF16.gguf \
  -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<｜Assistant｜><think>" \
  -n 512

# Interactive mode
./main -m /path/to/palmyra-mini-thinking-a-Q8_0.gguf -i

LM Studio Use

Steps to download a model through the Discover tab can be found here

Ollama Use

Please see the guide in this repo for steps on how to load this model into Ollama

Technical Specifications

Model Architecture

Model Type: qwen2 (Qwen2 Architecture)
Architecture: Qwen2ForCausalLM
Parameters: ~1.7 billion parameters
Base Precision: bfloat16
Specialization: Reasoning and thinking tasks

Core Parameters

Parameter	Value
Hidden Size	1,536
Intermediate Size	8,960
Number of Layers	28
Attention Heads	12
Key-Value Heads	2
Head Dimension	128
Vocabulary Size	151,665

Attention Mechanism

Attention Type: Full attention across all 28 layers
Max Position Embeddings: 131,072 tokens
Context Length: 4,096 tokens (default)
Sliding Window: Not used

Thinking Capabilities

Thinking Tokens: <think> (151648) and </think> (151649)
Reasoning Mode: Explicit step-by-step reasoning
Special Features: Designed for chain-of-thought reasoning

Quantization Comparison

Format	Size	Precision	Reasoning Quality	Speed	Memory
BF16	3.3GB	16-bit	Highest	Slower	High
Q8_0	1.8GB	8-bit	High	Faster	Medium

File Structure

palmyra-mini-thinking-a/GGUF/
├── palmyra-mini-thinking-a-BF16.gguf    # BF16 quantization
└── palmyra-mini-thinking-a-Q8_0.gguf    # Q8_0 quantization

Performance Characteristics

Hardware Requirements

CPU: Modern x86_64 or ARM64 processor
Memory:
- BF16: 4GB+ RAM recommended
- Q8_0: 3GB+ RAM recommended
Platform: Cross-platform (Windows, macOS, Linux)

Inference Performance

BF16: Highest reasoning quality, slower inference
Q8_0: ~45% smaller size, faster inference with preserved reasoning capabilities

Training Details

Tokenizer

Type: LlamaTokenizerFast with 151,665 vocabulary size
Special Tokens:
- BOS Token ID: 151646 ( )
- EOS Token ID: 151643 ( )
- Pad Token ID: 151643 ( )
- Think Start: 151648 (<think>)
- Think End: 151649 (</think>)

Model Configuration

Hidden Activation: SiLU (Swish)
Normalization: RMSNorm (ε = 1e-06)
Initializer Range: 0.02
Attention Dropout: 0.0

Chat Template

The model uses a specialized chat template for reasoning:

User messages:
Assistant messages:
Thinking mode: Automatically initiated with <think> tokens
Tool calling support

Usage Examples

Reasoning Task

./main -m palmyra-mini-thinking-a-Q8_0.gguf \
  -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<｜Assistant｜><think>" \
  -n 300 \
  --temp 0.7

Problem Solving

./main -m palmyra-mini-thinking-a-BF16.gguf \
  -p "Explain the water cycle step by step.<｜Assistant｜><think>" \
  -n 400 \
  --temp 0.8 \
  --top-p 0.9

Known Limitations

Context Length: Default context is 4,096 tokens, though the model supports up to 131,072
Thinking Overhead: Explicit thinking increases response length and generation time
Quantization Trade-offs: Lower bit quantizations may affect reasoning quality
Platform Optimization: Performance varies across different hardware configurations

Compatibility

llama.cpp: Compatible with recent versions
Frameworks: llama.cpp, Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
Platforms: Windows, macOS, Linux (x86_64, ARM64)
Special Features: Requires framework support for thinking tokens

License

Apache 2.0

Original model card: palmyra-mini-thinking-a

Model Details

Model Name: palmyra-mini-thinking-a

Version: 1.0

Type: Generative AI Language Model

Model Description

The palmyra-mini-thinking-a model demonstrates exceptional performance in advanced mathematical reasoning and competitive programming. Its capabilities are highlighted by an outstanding score of 0.886 on the 'MATH500' benchmark, showcasing a robust ability to solve complex mathematical problems. The strength of the model in quantitative challenges is further confirmed by its score of 0.8287 on 'gsm8k (strict-match)', which demonstrates proficiency in multi-step arithmetic reasoning. Additionally, the model proves its aptitude for high-level problem-solving with a score of 0.8 on 'AMC23'. The model also shows strong potential in the coding domain, achieving a score of 0.5631 on 'Codeforces (pass_rate)' and 0.5481 on 'Olympiadbench (extractive_match)', indicating competence in generating correct solutions for programming challenges.

Benchmark Performance

This section provides a detailed breakdown of the palmyra-mini-thinking-a model's performance across a standardized set of industry benchmarks. The data is presented in its original order from the source evaluation.

Benchmark	Score
gsm8k (strict-match)	0.8287
minerva_math(exact_match)	0.3842
mmlu_pro(exact_match)	0.2748
hendrycks_math	0.0054
ifeval (inst_level_loose_acc)	0.3657
mathqa (acc)	0.4171
humaneval (pass@1)	0.2378
BBH (get-answer)(exact_match)	0.462
mbpp	0.304
leadboard_musr (acc_norm)	0.3413
gpqa lighteval gpqa diamond_pass@1:8_samples	0.3826
AIME24(pass@1)(avg-of-1)	0.4333
AIME25(pass@1)(avg-of-1)	0.3667
Livecodebench-codegen (livecodebench/code_generation_lite v4_v5)	0.1784
AMC23	0.8
MATH500	0.886
Minerva	0.3493
Olympiadbench (extractive_match)	0.5481
Codecontests (pass_rate)	0.1778
Codeforces (pass_rate)	0.5631
Taco (pass_rate)	0.3083
APPS (all_levels)	0.0447
HMMT23 (extractive_match)	0.1
Average	0.380839

Intended Use

This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.

Limitations

The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.

Ethical Considerations

As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.

Downloads last month: 39

GGUF

Model size

2B params

Architecture

qwen2

Hardware compatibility

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Writer/palmyra-mini-thinking-a-GGUF

Base model

Writer/palmyra-mini-thinking-a

Quantized

(4)

this model