qwen3-0.6B-unimarc-grpo GGUF Quantized Versions

Model Description

This repository contains quantized versions of the fine-tuned Geraldine/qwen3-0.6B-unimarc-grpo model, which using GRPO (Generalized Repetition Penalized Optimization) and LoRA adapters to transform raw bibliographic metadata into structured UNIMARC XML records.

This repository provides various GGUF quantized formats, allowing efficient inference on different hardware setups, including CPUs and GPUs.


Available GGUF Files

The following quantized versions of the model were generated using llama.cpp:

File Name Description
qwen3-0.6B-unimarc-grpo-Q2_K.gguf Ultra-low precision (2-bit) for extreme compression
qwen3-0.6B-unimarc-grpo-Q3_K_M.gguf 3-bit quantization with mixed precision
qwen3-0.6B-unimarc-grpo-Q4_K_M.gguf 4-bit quantization with mixed precision
qwen3-0.6B-unimarc-grpo-Q5_K_M.gguf 5-bit quantization with mixed precision
qwen3-0.6B-unimarc-grpo-Q6_K.gguf 6-bit quantization
qwen3-0.6B-unimarc-grpo-Q8_0.gguf 8-bit quantization for balance between speed and accuracy
qwen3-0.6B-unimarc-grpo-fp16.gguf 16-bit floating point (fp16) version

How to Use the Quantized Model

Prompts

See Geraldine/qwen3-0.6B-unimarc-grpo to follow the recommended prompting template.

Running the Model with llama.cpp

To run the model using llama.cpp, use the following command:

./main -m qwen3-0.6B-unimarc-grpo-Q4_K_M.gguf -p "Convert the following bibliographic raw data into Unimarc/XML record: ..."

For optimal performance, ensure you select the right quantized version based on your hardware capabilities.

Running the Model with GPT4All

If using GPT4All, load the GGUF model with:

from gpt4all import GPT4All

model_path = "qwen3-0.6B-unimarc-grpo-Q4_K_M.gguf"
model = GPT4All(model_path)
response = model.generate("Convert the following bibliographic raw data into Unimarc/XML record:")
print(response)

Running the Model with Ollama

If using Ollama, load the GGUF model with:

ollama run hf.co/Geraldine/qwen3-0.6B-unimarc-grpo-GGUF:Q8_0
import requests
import json

url = "http://localhost:11434/v1/chat/completions"

payload = json.dumps({
  "model": "hf.co/Geraldine/qwen3-0.6B-unimarc-grpo-GGUF:Q8_0",
  "messages": [
    {
      "role": "system",
      "content": system_prompt
    },
    {
      "role": "user",
      "content": "Title: ...\nAuthors: ..."
    }
  ],
  "option": {
    "num_ctx": 4096,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0
  },
  "stream": False
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Choosing the Right Quantization Format

  • Lower-bit models (Q2_K, Q3_K_M, Q4_K_M): Best for low-memory devices, but may lose some accuracy.
  • Mid-range (Q5_K_M, Q6_K): Good trade-off between speed and precision.
  • Higher precision (Q8_0, fp16, fp32): Best for accuracy but requires more memory.

For CPU inference, Q4_K_M or Q5_K_M is recommended for a balance between efficiency and performance.


Limitations & Future Improvements

  • Limitations: Because of prompt templating during RL training, inference need to be optimized with the same prompt as during training
  • Future Work:
    • Further optimizations for CPU inference
    • Additional fine-tuning on larger datasets

Citation & Acknowledgments

If you use this model in research or production, please cite:

@misc{your-citation,
  author = {Géraldine Geoffroy},
  title = {qwen3-0.6B-unimarc-grpo GGUF Quantized Versions},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Geraldine/qwen3-0.6B-unimarc-grpo-GGUF}
}
Downloads last month
62
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Geraldine/qwen3-0.6B-unimarc-grpo-GGUF

Finetuned
Qwen/Qwen3-0.6B
Quantized
(1)
this model