Palmyra Mini Thinking A - GGUF
Model Description
This repository contains GGUF quantized versions of the palmyra-mini-thinking-a model, based on the Qwen2 architecture. This model is specifically designed for reasoning tasks with explicit thinking capabilities through special <think>
and </think>
tokens. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks.
Available Quantizations
BF16 (Brain Float 16)
- File:
palmyra-mini-thinking-a-BF16.gguf
- Size: 3.3GB
- Precision: 16-bit brain float
- Use Case: Highest quality reasoning, requires more memory
Q8_0 (8-bit Quantization)
- File:
palmyra-mini-thinking-a-Q8_0.gguf
- Size: 1.8GB
- Precision: 8-bit integer
- Use Case: Good balance of reasoning quality and efficiency
Quick Start
Installation
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Or use a pre-built binary
Usage
# Run with thinking prompt
./main -m /path/to/palmyra-mini-thinking-a-BF16.gguf \
-p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|><think>" \
-n 512
# Interactive mode
./main -m /path/to/palmyra-mini-thinking-a-Q8_0.gguf -i
LM Studio Use
Steps to download a model through the Discover tab can be found here
Ollama Use
Please see the guide in this repo for steps on how to load this model into Ollama
Technical Specifications
Model Architecture
- Model Type:
qwen2
(Qwen2 Architecture) - Architecture:
Qwen2ForCausalLM
- Parameters: ~1.7 billion parameters
- Base Precision: bfloat16
- Specialization: Reasoning and thinking tasks
Core Parameters
Parameter | Value |
---|---|
Hidden Size | 1,536 |
Intermediate Size | 8,960 |
Number of Layers | 28 |
Attention Heads | 12 |
Key-Value Heads | 2 |
Head Dimension | 128 |
Vocabulary Size | 151,665 |
Attention Mechanism
- Attention Type: Full attention across all 28 layers
- Max Position Embeddings: 131,072 tokens
- Context Length: 4,096 tokens (default)
- Sliding Window: Not used
Thinking Capabilities
- Thinking Tokens:
<think>
(151648) and</think>
(151649) - Reasoning Mode: Explicit step-by-step reasoning
- Special Features: Designed for chain-of-thought reasoning
Quantization Comparison
Format | Size | Precision | Reasoning Quality | Speed | Memory |
---|---|---|---|---|---|
BF16 | 3.3GB | 16-bit | Highest | Slower | High |
Q8_0 | 1.8GB | 8-bit | High | Faster | Medium |
File Structure
palmyra-mini-thinking-a/GGUF/
├── palmyra-mini-thinking-a-BF16.gguf # BF16 quantization
└── palmyra-mini-thinking-a-Q8_0.gguf # Q8_0 quantization
Performance Characteristics
Hardware Requirements
- CPU: Modern x86_64 or ARM64 processor
- Memory:
- BF16: 4GB+ RAM recommended
- Q8_0: 3GB+ RAM recommended
- Platform: Cross-platform (Windows, macOS, Linux)
Inference Performance
- BF16: Highest reasoning quality, slower inference
- Q8_0: ~45% smaller size, faster inference with preserved reasoning capabilities
Training Details
Tokenizer
- Type: LlamaTokenizerFast with 151,665 vocabulary size
- Special Tokens:
- BOS Token ID: 151646 (
- EOS Token ID: 151643 (
- Pad Token ID: 151643 (
- Think Start: 151648 (
<think>
) - Think End: 151649 (
</think>
)
- BOS Token ID: 151646 (
Model Configuration
- Hidden Activation: SiLU (Swish)
- Normalization: RMSNorm (ε = 1e-06)
- Initializer Range: 0.02
- Attention Dropout: 0.0
Chat Template
The model uses a specialized chat template for reasoning:
- User messages:
- Assistant messages:
- Thinking mode: Automatically initiated with
<think>
tokens - Tool calling support
Usage Examples
Reasoning Task
./main -m palmyra-mini-thinking-a-Q8_0.gguf \
-p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|><think>" \
-n 300 \
--temp 0.7
Problem Solving
./main -m palmyra-mini-thinking-a-BF16.gguf \
-p "Explain the water cycle step by step.<|Assistant|><think>" \
-n 400 \
--temp 0.8 \
--top-p 0.9
Known Limitations
- Context Length: Default context is 4,096 tokens, though the model supports up to 131,072
- Thinking Overhead: Explicit thinking increases response length and generation time
- Quantization Trade-offs: Lower bit quantizations may affect reasoning quality
- Platform Optimization: Performance varies across different hardware configurations
Compatibility
- llama.cpp: Compatible with recent versions
- Frameworks: llama.cpp, Ollama, LM Studio, GPT4All, and other GGUF-compatible tools
- Platforms: Windows, macOS, Linux (x86_64, ARM64)
- Special Features: Requires framework support for thinking tokens
License
Apache 2.0
Original model card: palmyra-mini-thinking-a
Model Details
Model Name: palmyra-mini-thinking-a
Version: 1.0
Type: Generative AI Language Model
Model Description
The palmyra-mini-thinking-a model demonstrates exceptional performance in advanced mathematical reasoning and competitive programming. Its capabilities are highlighted by an outstanding score of 0.886 on the 'MATH500' benchmark, showcasing a robust ability to solve complex mathematical problems. The strength of the model in quantitative challenges is further confirmed by its score of 0.8287 on 'gsm8k (strict-match)', which demonstrates proficiency in multi-step arithmetic reasoning. Additionally, the model proves its aptitude for high-level problem-solving with a score of 0.8 on 'AMC23'. The model also shows strong potential in the coding domain, achieving a score of 0.5631 on 'Codeforces (pass_rate)' and 0.5481 on 'Olympiadbench (extractive_match)', indicating competence in generating correct solutions for programming challenges.
Benchmark Performance
This section provides a detailed breakdown of the palmyra-mini-thinking-a model's performance across a standardized set of industry benchmarks. The data is presented in its original order from the source evaluation.
Benchmark | Score |
---|---|
gsm8k (strict-match) | 0.8287 |
minerva_math(exact_match) | 0.3842 |
mmlu_pro(exact_match) | 0.2748 |
hendrycks_math | 0.0054 |
ifeval (inst_level_loose_acc) | 0.3657 |
mathqa (acc) | 0.4171 |
humaneval (pass@1) | 0.2378 |
BBH (get-answer)(exact_match) | 0.462 |
mbpp | 0.304 |
leadboard_musr (acc_norm) | 0.3413 |
gpqa lighteval gpqa diamond_pass@1:8_samples | 0.3826 |
AIME24(pass@1)(avg-of-1) | 0.4333 |
AIME25(pass@1)(avg-of-1) | 0.3667 |
Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.1784 |
AMC23 | 0.8 |
MATH500 | 0.886 |
Minerva | 0.3493 |
Olympiadbench (extractive_match) | 0.5481 |
Codecontests (pass_rate) | 0.1778 |
Codeforces (pass_rate) | 0.5631 |
Taco (pass_rate) | 0.3083 |
APPS (all_levels) | 0.0447 |
HMMT23 (extractive_match) | 0.1 |
Average | 0.380839 |
Intended Use
This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.
Limitations
The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.
Ethical Considerations
As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.
- Downloads last month
- 39
8-bit
16-bit
Model tree for Writer/palmyra-mini-thinking-a-GGUF
Base model
Writer/palmyra-mini-thinking-a