qwen3-0.6B-unimarc-grpo GGUF Quantized Versions

Model Description

This repository contains quantized versions of the fine-tuned Geraldine/qwen3-0.6B-unimarc-grpo model, which using GRPO (Generalized Repetition Penalized Optimization) and LoRA adapters to transform raw bibliographic metadata into structured UNIMARC XML records.

This repository provides various GGUF quantized formats, allowing efficient inference on different hardware setups, including CPUs and GPUs.

Available GGUF Files

The following quantized versions of the model were generated using llama.cpp:

File Name	Description
`qwen3-0.6B-unimarc-grpo-Q2_K.gguf`	Ultra-low precision (2-bit) for extreme compression
`qwen3-0.6B-unimarc-grpo-Q3_K_M.gguf`	3-bit quantization with mixed precision
`qwen3-0.6B-unimarc-grpo-Q4_K_M.gguf`	4-bit quantization with mixed precision
`qwen3-0.6B-unimarc-grpo-Q5_K_M.gguf`	5-bit quantization with mixed precision
`qwen3-0.6B-unimarc-grpo-Q6_K.gguf`	6-bit quantization
`qwen3-0.6B-unimarc-grpo-Q8_0.gguf`	8-bit quantization for balance between speed and accuracy
`qwen3-0.6B-unimarc-grpo-fp16.gguf`	16-bit floating point (fp16) version

How to Use the Quantized Model

Prompts

See Geraldine/qwen3-0.6B-unimarc-grpo to follow the recommended prompting template.

Running the Model with llama.cpp

To run the model using llama.cpp, use the following command:

./main -m qwen3-0.6B-unimarc-grpo-Q4_K_M.gguf -p "Convert the following bibliographic raw data into Unimarc/XML record: ..."

For optimal performance, ensure you select the right quantized version based on your hardware capabilities.

Running the Model with GPT4All

If using GPT4All, load the GGUF model with:

from gpt4all import GPT4All

model_path = "qwen3-0.6B-unimarc-grpo-Q4_K_M.gguf"
model = GPT4All(model_path)
response = model.generate("Convert the following bibliographic raw data into Unimarc/XML record:")
print(response)

Running the Model with Ollama

If using Ollama, load the GGUF model with:

ollama run hf.co/Geraldine/qwen3-0.6B-unimarc-grpo-GGUF:Q8_0

import requests
import json

url = "http://localhost:11434/v1/chat/completions"

payload = json.dumps({
  "model": "hf.co/Geraldine/qwen3-0.6B-unimarc-grpo-GGUF:Q8_0",
  "messages": [
    {
      "role": "system",
      "content": system_prompt
    },
    {
      "role": "user",
      "content": "Title: ...\nAuthors: ..."
    }
  ],
  "option": {
    "num_ctx": 4096,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 20,
    "min_p": 0
  },
  "stream": False
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Choosing the Right Quantization Format

Lower-bit models (Q2_K, Q3_K_M, Q4_K_M): Best for low-memory devices, but may lose some accuracy.
Mid-range (Q5_K_M, Q6_K): Good trade-off between speed and precision.
Higher precision (Q8_0, fp16, fp32): Best for accuracy but requires more memory.

For CPU inference, Q4_K_M or Q5_K_M is recommended for a balance between efficiency and performance.

Limitations & Future Improvements

Limitations: Because of prompt templating during RL training, inference need to be optimized with the same prompt as during training
Future Work:
- Further optimizations for CPU inference
- Additional fine-tuning on larger datasets

Citation & Acknowledgments

If you use this model in research or production, please cite:

@misc{your-citation,
  author = {Géraldine Geoffroy},
  title = {qwen3-0.6B-unimarc-grpo GGUF Quantized Versions},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Geraldine/qwen3-0.6B-unimarc-grpo-GGUF}
}