Palmyra Mini Thinking A - GGUF

Model Description

This repository contains GGUF quantized versions of the palmyra-mini-thinking-a model, based on the Qwen2 architecture. This model is specifically designed for reasoning tasks with explicit thinking capabilities through special <think> and </think> tokens. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.

Available Quantizations

BF16 (Brain Float 16)

  • File: palmyra-mini-thinking-a-BF16.gguf
  • Size: 3.3GB
  • Precision: 16-bit brain float
  • Use Case: Highest quality reasoning, requires more memory

Q8_0 (8-bit Quantization)

  • File: palmyra-mini-thinking-a-Q8_0.gguf
  • Size: 1.8GB
  • Precision: 8-bit integer
  • Use Case: Good balance of reasoning quality and efficiency

Quick Start

Installation

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Or use a pre-built binary

Usage

# Run with thinking prompt
./main -m /path/to/palmyra-mini-thinking-a-BF16.gguf \
  -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|><think>" \
  -n 512

# Interactive mode
./main -m /path/to/palmyra-mini-thinking-a-Q8_0.gguf -i

LM Studio Use

Steps to download a model through the Discover tab can be found here

Ollama Use

Please see the guide in this repo for steps on how to load this model into Ollama

Technical Specifications

Model Architecture

  • Model Type: qwen2 (Qwen2 Architecture)
  • Architecture: Qwen2ForCausalLM
  • Parameters: ~1.7 billion parameters
  • Base Precision: bfloat16
  • Specialization: Reasoning and thinking tasks

Core Parameters

Parameter Value
Hidden Size 1,536
Intermediate Size 8,960
Number of Layers 28
Attention Heads 12
Key-Value Heads 2
Head Dimension 128
Vocabulary Size 151,665

Attention Mechanism

  • Attention Type: Full attention across all 28 layers
  • Max Position Embeddings: 131,072 tokens
  • Context Length: 4,096 tokens (default)
  • Sliding Window: Not used

Thinking Capabilities

  • Thinking Tokens: <think> (151648) and </think> (151649)
  • Reasoning Mode: Explicit step-by-step reasoning
  • Special Features: Designed for chain-of-thought reasoning

Quantization Comparison

Format Size Precision Reasoning Quality Speed Memory
BF16 3.3GB 16-bit Highest Slower High
Q8_0 1.8GB 8-bit High Faster Medium

File Structure

palmyra-mini-thinking-a/GGUF/
├── palmyra-mini-thinking-a-BF16.gguf    # BF16 quantization
└── palmyra-mini-thinking-a-Q8_0.gguf    # Q8_0 quantization

Performance Characteristics

Hardware Requirements

  • CPU: Modern x86_64 or ARM64 processor
  • Memory:
    • BF16: 4GB+ RAM recommended
    • Q8_0: 3GB+ RAM recommended
  • Platform: Cross-platform (Windows, macOS, Linux)

Inference Performance

  • BF16: Highest reasoning quality, slower inference
  • Q8_0: ~45% smaller size, faster inference with preserved reasoning capabilities

Training Details

Tokenizer

  • Type: LlamaTokenizerFast with 151,665 vocabulary size
  • Special Tokens:
    • BOS Token ID: 151646 ( )
    • EOS Token ID: 151643 ( )
    • Pad Token ID: 151643 ( )
    • Think Start: 151648 (<think>)
    • Think End: 151649 (</think>)

Model Configuration

  • Hidden Activation: SiLU (Swish)
  • Normalization: RMSNorm (ε = 1e-06)
  • Initializer Range: 0.02
  • Attention Dropout: 0.0

Chat Template

The model uses a specialized chat template for reasoning:

  • User messages:
  • Assistant messages:
  • Thinking mode: Automatically initiated with <think> tokens
  • Tool calling support

Usage Examples

Reasoning Task

./main -m palmyra-mini-thinking-a-Q8_0.gguf \
  -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|><think>" \
  -n 300 \
  --temp 0.7

Problem Solving

./main -m palmyra-mini-thinking-a-BF16.gguf \
  -p "Explain the water cycle step by step.<|Assistant|><think>" \
  -n 400 \
  --temp 0.8 \
  --top-p 0.9

Known Limitations

  1. Context Length: Default context is 4,096 tokens, though the model supports up to 131,072
  2. Thinking Overhead: Explicit thinking increases response length and generation time
  3. Quantization Trade-offs: Lower bit quantizations may affect reasoning quality
  4. Platform Optimization: Performance varies across different hardware configurations

Compatibility

  • llama.cpp: Compatible with recent versions
  • Frameworks: llama.cpp, Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
  • Platforms: Windows, macOS, Linux (x86_64, ARM64)
  • Special Features: Requires framework support for thinking tokens

License

Apache 2.0


Original model card: palmyra-mini-thinking-a

Model Details

Model Name: palmyra-mini-thinking-a

Version: 1.0

Type: Generative AI Language Model

Model Description

The palmyra-mini-thinking-a model demonstrates exceptional performance in advanced mathematical reasoning and competitive programming. Its capabilities are highlighted by an outstanding score of 0.886 on the 'MATH500' benchmark, showcasing a robust ability to solve complex mathematical problems. The strength of the model in quantitative challenges is further confirmed by its score of 0.8287 on 'gsm8k (strict-match)', which demonstrates proficiency in multi-step arithmetic reasoning. Additionally, the model proves its aptitude for high-level problem-solving with a score of 0.8 on 'AMC23'. The model also shows strong potential in the coding domain, achieving a score of 0.5631 on 'Codeforces (pass_rate)' and 0.5481 on 'Olympiadbench (extractive_match)', indicating competence in generating correct solutions for programming challenges.

Benchmark Performance

This section provides a detailed breakdown of the palmyra-mini-thinking-a model's performance across a standardized set of industry benchmarks. The data is presented in its original order from the source evaluation.

Benchmark Score
gsm8k (strict-match) 0.8287
minerva_math(exact_match) 0.3842
mmlu_pro(exact_match) 0.2748
hendrycks_math 0.0054
ifeval (inst_level_loose_acc) 0.3657
mathqa (acc) 0.4171
humaneval (pass@1) 0.2378
BBH (get-answer)(exact_match) 0.462
mbpp 0.304
leadboard_musr (acc_norm) 0.3413
gpqa lighteval gpqa diamond_pass@1:8_samples 0.3826
AIME24(pass@1)(avg-of-1) 0.4333
AIME25(pass@1)(avg-of-1) 0.3667
Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) 0.1784
AMC23 0.8
MATH500 0.886
Minerva 0.3493
Olympiadbench (extractive_match) 0.5481
Codecontests (pass_rate) 0.1778
Codeforces (pass_rate) 0.5631
Taco (pass_rate) 0.3083
APPS (all_levels) 0.0447
HMMT23 (extractive_match) 0.1
Average 0.380839

Intended Use

This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.

Limitations

The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.

Ethical Considerations

As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.

Downloads last month
39
GGUF
Model size
2B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Writer/palmyra-mini-thinking-a-GGUF

Quantized
(4)
this model