--- license: apache-2.0 base_model: - Writer/palmyra-mini-thinking-a tags: - gguf - qwen2 - palmyra - thinking - reasoning - quantized --- # Palmyra Mini Thinking A - GGUF ## Model Description This repository contains GGUF quantized versions of the [palmyra-mini-thinking-a model](https://huggingface.co/Writer/palmyra-mini-thinking-a), based on the Qwen2 architecture. This model is specifically designed for reasoning tasks with explicit thinking capabilities through special `` and `` tokens. GGUF quantizations are optimized for efficient inference across various hardware platforms using llama.cpp and compatible frameworks. ## Available Quantizations ### BF16 (Brain Float 16) - **File**: `palmyra-mini-thinking-a-BF16.gguf` - **Size**: 3.3GB - **Precision**: 16-bit brain float - **Use Case**: Highest quality reasoning, requires more memory ### Q8_0 (8-bit Quantization) - **File**: `palmyra-mini-thinking-a-Q8_0.gguf` - **Size**: 1.8GB - **Precision**: 8-bit integer - **Use Case**: Good balance of reasoning quality and efficiency ## Quick Start ### Installation ```bash # Install llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # Or use a pre-built binary ``` ### Usage ```bash # Run with thinking prompt ./main -m /path/to/palmyra-mini-thinking-a-BF16.gguf \ -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|>" \ -n 512 # Interactive mode ./main -m /path/to/palmyra-mini-thinking-a-Q8_0.gguf -i ``` ## LM Studio Use Steps to download a model through the **Discover** tab can be found [here](https://lmstudio.ai/docs/app/basics/download-model) ### Ollama Use Please see [the guide in this repo](https://huggingface.co/Writer/palmyra-mini-thinking-a-GGUF/resolve/main/ollama-README-A.md?download=true) for steps on how to load this model into Ollama ## Technical Specifications ### Model Architecture - **Model Type**: `qwen2` (Qwen2 Architecture) - **Architecture**: `Qwen2ForCausalLM` - **Parameters**: ~1.7 billion parameters - **Base Precision**: bfloat16 - **Specialization**: Reasoning and thinking tasks ### Core Parameters | Parameter | Value | |-----------|-------| | Hidden Size | 1,536 | | Intermediate Size | 8,960 | | Number of Layers | 28 | | Attention Heads | 12 | | Key-Value Heads | 2 | | Head Dimension | 128 | | Vocabulary Size | 151,665 | ### Attention Mechanism - **Attention Type**: Full attention across all 28 layers - **Max Position Embeddings**: 131,072 tokens - **Context Length**: 4,096 tokens (default) - **Sliding Window**: Not used ### Thinking Capabilities - **Thinking Tokens**: `` (151648) and `` (151649) - **Reasoning Mode**: Explicit step-by-step reasoning - **Special Features**: Designed for chain-of-thought reasoning ### Quantization Comparison | Format | Size | Precision | Reasoning Quality | Speed | Memory | |--------|------|-----------|-------------------|-------|--------| | BF16 | 3.3GB| 16-bit | Highest | Slower| High | | Q8_0 | 1.8GB| 8-bit | High | Faster| Medium | ### File Structure ``` palmyra-mini-thinking-a/GGUF/ ├── palmyra-mini-thinking-a-BF16.gguf # BF16 quantization └── palmyra-mini-thinking-a-Q8_0.gguf # Q8_0 quantization ``` ## Performance Characteristics ### Hardware Requirements - **CPU**: Modern x86_64 or ARM64 processor - **Memory**: - BF16: 4GB+ RAM recommended - Q8_0: 3GB+ RAM recommended - **Platform**: Cross-platform (Windows, macOS, Linux) ### Inference Performance - **BF16**: Highest reasoning quality, slower inference - **Q8_0**: ~45% smaller size, faster inference with preserved reasoning capabilities ## Training Details ### Tokenizer - **Type**: LlamaTokenizerFast with 151,665 vocabulary size - **Special Tokens**: - BOS Token ID: 151646 (` `) - EOS Token ID: 151643 (` `) - Pad Token ID: 151643 (` `) - Think Start: 151648 (``) - Think End: 151649 (``) ### Model Configuration - **Hidden Activation**: SiLU (Swish) - **Normalization**: RMSNorm (ε = 1e-06) - **Initializer Range**: 0.02 - **Attention Dropout**: 0.0 ### Chat Template The model uses a specialized chat template for reasoning: - User messages: ` ` - Assistant messages: ` ` - Thinking mode: Automatically initiated with `` tokens - Tool calling support ## Usage Examples ### Reasoning Task ```bash ./main -m palmyra-mini-thinking-a-Q8_0.gguf \ -p "A rectangle has a length of 12 cm and width of 8 cm. What is its area and perimeter?<|Assistant|>" \ -n 300 \ --temp 0.7 ``` ### Problem Solving ```bash ./main -m palmyra-mini-thinking-a-BF16.gguf \ -p "Explain the water cycle step by step.<|Assistant|>" \ -n 400 \ --temp 0.8 \ --top-p 0.9 ``` ## Known Limitations 1. **Context Length**: Default context is 4,096 tokens, though the model supports up to 131,072 2. **Thinking Overhead**: Explicit thinking increases response length and generation time 3. **Quantization Trade-offs**: Lower bit quantizations may affect reasoning quality 4. **Platform Optimization**: Performance varies across different hardware configurations ## Compatibility - **llama.cpp**: Compatible with recent versions - **Frameworks**: llama.cpp, Ollama, LM Studio, GPT4All, and other GGUF-compatible tools - **Platforms**: Windows, macOS, Linux (x86_64, ARM64) - **Special Features**: Requires framework support for thinking tokens ## License Apache 2.0 --- # Original model card: palmyra-mini-thinking-a ## Model Details **Model Name:** palmyra-mini-thinking-a **Version:** 1.0 **Type:** Generative AI Language Model ## Model Description The palmyra-mini-thinking-a model demonstrates exceptional performance in advanced mathematical reasoning and competitive programming. Its capabilities are highlighted by an outstanding score of 0.886 on the 'MATH500' benchmark, showcasing a robust ability to solve complex mathematical problems. The strength of the model in quantitative challenges is further confirmed by its score of 0.8287 on 'gsm8k (strict-match)', which demonstrates proficiency in multi-step arithmetic reasoning. Additionally, the model proves its aptitude for high-level problem-solving with a score of 0.8 on 'AMC23'. The model also shows strong potential in the coding domain, achieving a score of 0.5631 on 'Codeforces (pass_rate)' and 0.5481 on 'Olympiadbench (extractive_match)', indicating competence in generating correct solutions for programming challenges. ## Benchmark Performance This section provides a detailed breakdown of the palmyra-mini-thinking-a model's performance across a standardized set of industry benchmarks. The data is presented in its original order from the source evaluation. | Benchmark | Score | |:-----------------------------------------------------------------|---------:| | gsm8k (strict-match) | 0.8287 | | minerva_math(exact_match) | 0.3842 | | mmlu_pro(exact_match) | 0.2748 | | hendrycks_math | 0.0054 | | ifeval (inst_level_loose_acc) | 0.3657 | | mathqa (acc) | 0.4171 | | humaneval (pass@1) | 0.2378 | | BBH (get-answer)(exact_match) | 0.462 | | mbpp | 0.304 | | leadboard_musr (acc_norm) | 0.3413 | | gpqa lighteval gpqa diamond_pass@1:8_samples | 0.3826 | | AIME24(pass@1)(avg-of-1) | 0.4333 | | AIME25(pass@1)(avg-of-1) | 0.3667 | | Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.1784 | | AMC23 | 0.8 | | MATH500 | 0.886 | | Minerva | 0.3493 | | Olympiadbench (extractive_match) | 0.5481 | | Codecontests (pass_rate) | 0.1778 | | Codeforces (pass_rate) | 0.5631 | | Taco (pass_rate) | 0.3083 | | APPS (all_levels) | 0.0447 | | HMMT23 (extractive_match) | 0.1 | | Average | 0.380839 | ## Intended Use This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning. ## Limitations The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary. ## Ethical Considerations As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.