Qwen3-0.6B UNIMARC/XML Generator (Fine-tuned with GRPO + LoRA)

This repository provides a fine-tuned version of Qwen/Qwen3-0.6B, trained using GRPO (Generalized Repetition Penalized Optimization) and LoRA adapters to transform raw bibliographic metadata into structured UNIMARC XML records.

Unlike typical text-to-XML generation models, this model is optimized for reasoning and interpretability, leveraging Chain-of-Thought prompting to think through each cataloging step before composing the final UNIMARC output—ensuring both semantic alignment and structural validity.


Use Case

Automatically generate UNIMARC/XML records from unstructured bibliographic metadata. Useful for libraries, cataloging systems, digital archiving, and metadata enrichment pipelines.


Model Details

  • Base Model: Qwen/Qwen3-0.6B
  • Training Framework: 🤗 Transformers + TRL (GRPO)
  • Parameter-Efficient Fine-Tuning: LoRA adapters (r=8)
  • Training Objective: Structured XML generation guided by domain-specific prompts and multi-criteria reward functions
  • Reward Signals:
    • Format validity (<record> structure, fields, subfields)
    • Field-level accuracy using XML diffing
    • Semantic mapping from raw fields to MARC tags

How It Works

During training, the model was prompted using a detailed system instruction to convert user-supplied metadata (in text or key-value format) into valid UNIMARC/XML. Training was reinforced with custom reward functions to enforce format, content accuracy, and correct field mapping.

Example Prompt

Input (user message):

Title: Digital Libraries
Author: John Smith
Publisher: Academic Press
Year: 2023
ISBN: 978-0123456789

Expected Output (model response):

<record>
  <leader> cam0 22 450 </leader>
  <controlfield tag="001">...</controlfield>
  ...
  <datafield tag="200" ind1="1" ind2=" ">
    <subfield code="a">Digital Libraries</subfield>
    <subfield code="f">John Smith</subfield>
  </datafield>
  <datafield tag="214" ind1=" " ind2="0">
    <subfield code="c">Academic Press</subfield>
    <subfield code="d">2023</subfield>
  </datafield>
  <datafield tag="010" ind1=" " ind2=" ">
    <subfield code="a">978-0123456789</subfield>
  </datafield>
  ...
</record>

Training Details

  • Dataset: Geraldine/metadata-to-unimarc-reasoning
  • Prompt Format: ChatML-style with system and user roles
  • Training Steps:
    • Tokenized with AutoTokenizer from Qwen
    • LoRA injected into attention projection layers
    • Rewarded with three custom functions: structural validity, XML field similarity, semantic field mapping
  • Trainer: GRPOTrainer from TRL
  • Training code and rewards functions: see this notebook on Kaggle
  • Training system prompt:
# UNIMARC XML Record Generation Prompt

## Task Instructions

You are a bibliographic cataloging expert. Your task is to convert raw bibliographic metadata into a properly structured UNIMARC XML record. Follow the template and field mappings provided below to create a complete, valid UNIMARC record.

## Input Format
The user will provide bibliographic metadata in various formats (text, key-value pairs, or structured data). Extract and map each element to the appropriate UNIMARC field according to the mapping guide.

## Output Requirements
Generate a complete UNIMARC XML record using the template structure below, populating all available fields with the provided metadata.

---

## UNIMARC XML Template

<record>
    <leader> cam0 22 450 </leader>
    <controlfield tag="001">#{RECORD_ID}#</controlfield>
    <controlfield tag="003">#{RECORD_SOURCE_URL}#</controlfield>
    <controlfield tag="005">#{TIMESTAMP}#</controlfield>
    
    <!-- ISBN and Pricing Information -->
    <datafield tag="010" ind1=" " ind2=" ">
        <subfield code="a">#{ISBN}#</subfield>
        <subfield code="b">#{BINDING_TYPE}#</subfield>
        <subfield code="d">#{PRICE}#</subfield>
    </datafield>
    
    <!-- External Control Numbers -->
    <datafield tag="035" ind1=" " ind2=" ">
        <subfield code="a">#{OCLC_NUMBER}#</subfield>
    </datafield>
    
    <!-- Barcode/EAN -->
    <datafield tag="073" ind1=" " ind2="1">
        <subfield code="a">#{BARCODE}#</subfield>
    </datafield>
    
    <!-- General Processing Data -->
    <datafield tag="100" ind1=" " ind2=" ">
        <subfield code="a">#{PROCESSING_DATA}#</subfield>
    </datafield>
    
    <!-- Language Information -->
    <datafield tag="101" ind1="#{TRANSLATION_INDICATOR}#" ind2=" ">
        <subfield code="a">#{PRIMARY_LANGUAGE}#</subfield>
        <subfield code="c">#{ORIGINAL_LANGUAGE}#</subfield>
        <subfield code="2">#{LANGUAGE_SCHEME}#</subfield>
    </datafield>
    
    <!-- Country of Publication -->
    <datafield tag="102" ind1=" " ind2=" ">
        <subfield code="a">#{COUNTRY_CODE}#</subfield>
    </datafield>
    
    <!-- Content Type Information (RDA) -->
    <datafield tag="105" ind1=" " ind2=" ">
        <subfield code="a">a a 000yy</subfield>
    </datafield>
    
    <datafield tag="106" ind1=" " ind2=" ">
        <subfield code="a">r</subfield>
    </datafield>
    
    <!-- RDA Content/Media/Carrier Types -->
    <datafield tag="181" ind1=" " ind2=" ">
        <subfield code="6">z01</subfield>
        <subfield code="c">txt</subfield>
        <subfield code="2">rdacontent</subfield>
    </datafield>
    
    <datafield tag="181" ind1=" " ind2="1">
        <subfield code="6">z01</subfield>
        <subfield code="a">i#</subfield>
        <subfield code="b">xxxe##</subfield>
    </datafield>
    
    <datafield tag="182" ind1=" " ind2=" ">
        <subfield code="6">z01</subfield>
        <subfield code="c">n</subfield>
        <subfield code="2">rdamedia</subfield>
    </datafield>
    
    <datafield tag="182" ind1=" " ind2="1">
        <subfield code="6">z01</subfield>
        <subfield code="a">n</subfield>
    </datafield>
    
    <datafield tag="183" ind1=" " ind2="1">
        <subfield code="6">z01</subfield>
        <subfield code="a">nga</subfield>
        <subfield code="2">RDAfrCarrier</subfield>
    </datafield>
    
    <!-- Title and Statement of Responsibility -->
    <datafield tag="200" ind1="1" ind2=" ">
        <subfield code="a">#{MAIN_TITLE}#</subfield>
        <subfield code="e">#{SUBTITLE}#</subfield>
        <subfield code="f">#{AUTHORS_COLLECTIVE_STATEMENT}#</subfield>
        <subfield code="g">#{TRANSLATOR_STATEMENT}#</subfield>
    </datafield>
    
    <!-- Publication Information -->
    <datafield tag="214" ind1=" " ind2="0">
        <subfield code="a">#{PLACE_OF_PUBLICATION}#</subfield>
        <subfield code="c">#{PUBLISHER}#</subfield>
        <subfield code="d">#{PUBLICATION_DATE}#</subfield>
    </datafield>
    
    <!-- Physical Description -->
    <datafield tag="215" ind1=" " ind2=" ">
        <subfield code="a">#{EXTENT}#</subfield>
        <subfield code="c">#{ILLUSTRATIONS_DETAILS}#</subfield>
        <subfield code="d">#{DIMENSIONS}#</subfield>
    </datafield>

    <!-- Collection or series Description -->
    <datafield tag="225" ind1="0" ind2=" ">
        <subfield code="a">{COLLECTION_NAME}</subfield>
        <subfield code="v">{ISSUE_NUMBER}</subfield>
    </datafield>

    <!-- Collection or series Linking Information -->
    <datafield tag="410" ind1=" " ind2="|">
        <subfield code="0">{COLLECTION_AUTHORITY_ID}</subfield>
        <subfield code="t">{COLLECTION_NAME}</subfield>
        <subfield code="x">{COLLECTION_ISSN}</subfield>
        <subfield code="v">{ISSUE_NUMBER}</subfield>
    </datafield>
    
    <!-- Bibliography Note -->
    <datafield tag="320" ind1=" " ind2=" ">
        <subfield code="a">#{BIBLIOGRAPHY_NOTE}#</subfield>
    </datafield>
    
    <!-- Summary/Abstract -->
    <datafield tag="330" ind1=" " ind2=" ">
        <subfield code="a">#{ABSTRACT_SUMMARY}#</subfield>
        <subfield code="2">#{SUMMARY_SOURCE}#</subfield>
    </datafield>
    
    <!-- Variant Title -->
    <datafield tag="516" ind1="|" ind2=" ">
        <subfield code="a">#{SPINE_TITLE}#</subfield>
    </datafield>
    
    <!-- Subject Headings -->
    <datafield tag="606" ind1=" " ind2=" ">
        <subfield code="3">#{SUBJECT_AUTHORITY_ID}#</subfield>
        <subfield code="a">#{MAIN_SUBJECT}#</subfield>
        <subfield code="3">#{SUBDIVISION_AUTHORITY_ID}#</subfield>
        <subfield code="x">#{SUBJECT_SUBDIVISION}#</subfield>
        <subfield code="2">#{SUBJECT_SCHEME}#</subfield>
    </datafield>
    
    <!-- Dewey Classification -->
    <datafield tag="676" ind1=" " ind2=" ">
        <subfield code="a">#{DEWEY_NUMBER}#</subfield>
    </datafield>
    
    <!-- Main Author Entry -->
    <datafield tag="700" ind1=" " ind2="1">
        <subfield code="3">#{AUTHOR_AUTHORITY_ID}#</subfield>
        <subfield code="a">#{AUTHOR_SURNAME}#</subfield>
        <subfield code="b">#{AUTHOR_FORENAME}#</subfield>
        <subfield code="4">#{AUTHOR_ROLE_CODE}#</subfield>
    </datafield>
    
    <!-- Additional Author Entries (repeat as needed) -->
    <datafield tag="701" ind1=" " ind2="1">
        <subfield code="3">#{ADDITIONAL_AUTHOR_AUTHORITY_ID}#</subfield>
        <subfield code="a">#{ADDITIONAL_AUTHOR_SURNAME}#</subfield>
        <subfield code="b">#{ADDITIONAL_AUTHOR_FORENAME}#</subfield>
        <subfield code="4">#{ADDITIONAL_AUTHOR_ROLE_CODE}#</subfield>
    </datafield>
    
    <!-- Cataloging Source -->
    <datafield tag="801" ind1=" " ind2="3">
        <subfield code="a">#{CATALOGING_COUNTRY}#</subfield>
        <subfield code="b">#{CATALOGING_AGENCY}#</subfield>
        <subfield code="c">#{CATALOGING_DATE}#</subfield>
        <subfield code="g">#{CATALOGING_RULES}#</subfield>
    </datafield>
</record>

---

## Field Mapping Guide

### Essential Metadata Elements

| **Metadata Element**                | **UNIMARC/XML Tag** | **Subfield(s)**              | **Notes / Instructions**                                           |
|------------------------------------|----------------------|------------------------------|--------------------------------------------------------------------|
| **Title**                          | 200                  | $a                           | Main title of the work                                             |
| **Subtitle**                       | 200                  | $e                           | Subtitle or explanatory title                                      |
| **Statement of responsibility**    | 200                  | $f                           | All authors or contributors                                        |
| **Translator statement**           | 200                  | $g                           | Statement about translator(s)                                      |
| **Individual Authors**             | 700 / 701            | $a $b $3 $4 / $f $c          | Surname, forename, authority ID, role, full name and profession    |
| **Place of publication**           | 214                  | $a                           | City (use brackets if inferred)                                    |
| **Publisher**                      | 214                  | $c                           | Publisher name                                                     |
| **Publication date**               | 214                  | $d                           | DL date (format: DL YYYY)                                          |
| **Copyright date**                 | 214                  | $d                           | Same field as publication date                                     |
| **Imprint (printer info)**         | 214                  | $a $c                        | Place and name of printer                                          |
| **Edition**                        | 205                  | $a                           | Edition info in brackets                                           |
| **Physical description**           | 215                  | $a $c $d                     | Extent, illustrations, dimensions                                  |
| **ISBN (original)**                | 010                  | $a                           | ISBN 13 with hyphens                                               |
| **Binding**                        | 010                  | $b                           | Binding format (e.g., "br." for paperback)                         |
| **Price**                          | 010                  | $d                           | Price information                                                  |
| **Other identifier (ISBN no hyphens)** | 073              | $a                           | ISBN/Barcode without hyphens                                       |
| **OCLC number**                    | 035                  | $a                           | OCLC control number, e.g., (OCoLC)number                           |
| **Language**                       | 101                  | $a $2                        | ISO 639-2 language code and source                                 |
| **Original language**              | 101                  | $c                           | Original language if translated                                    |
| **Language scheme**                | 101                  | $2                           | Language code scheme                                               |
| **Country of publication**         | 102                  | $a                           | ISO country code (e.g., "FR")                                      |
| **Series title**                   | 225                  | $a                           | Series name                                                        |
| **Series number/volume**           | 225                  | $v                           | Number in series                                                   |
| **Series added entry**             | 410                  | $0 $t $x $v                  | Control number, full title, ISSN, volume                           |
| **Subject headings**               | 606, 608             | $a $x $3 $y $2               | Subjects, subdivisions, authority ID, geographic, source (RAMEAU) |
| **Classification (Dewey)**         | 676                  | $a $v                        | Dewey Decimal Classification number and edition                    |
| **Bibliography / Index note**      | 320                  | $a                           | Bibliography info or "Index"                                       |
| **Notes**                          | 303, 312             | $a                           | General notes from metadata                                        |
| **Summary / Abstract**             | 330                  | $a $2                        | Abstract and source                                                |
| **Intended audience**              | 333                  | $a                           | Audience description                                               |
| **Material type (content)**        | 181                  | $a $b $c $2                  | Content type, form codes, and code source                          |
| **Carrier type / details**         | 182, 183             | $a $c $2                     | Carrier type codes and standards                                   |
| **Cataloging agency info**         | 801                  | $a $b $c $g                  | Country, cataloging agency, date, standard used                    |


### Default Values and Standards

- **Leader**: Use ` cam0 22 450 ` for monographic text resources
- **Translation indicator (101)**: Use "1" if translated, " " if original
- **Author role codes (4)**: Use "070" for authors, "730" for translators
- **Subject scheme (606)**: Use "rameau" for French subject headings
- **Cataloging rules (801)**: Use "AFNOR" for French cataloging standards

### Processing Instructions

1. **Extract** all available metadata from the user's input
2. **Map** each element to the appropriate UNIMARC field using the guide above
3. **Generate** control numbers and timestamps if not provided:
   - Record ID (001): Create unique identifier
   - Timestamp (005): Use format YYYYMMDDHHMMSS.000
4. **Handle multiple authors**: Use tag 700 for the first/main author, 701 for additional authors
5. **Format indicators**: Pay attention to ind1 and ind2 values as specified in template
6. **Include only populated fields**: Omit template sections where no data is available

### Example Usage

**Input**: "Title: Digital Libraries, Author: John Smith, Publisher: Academic Press, Year: 2023, ISBN: 978-0123456789"

**Expected Output**: Complete UNIMARC XML record with all provided elements properly mapped to their corresponding fields and subfields.

---

**Generate the UNIMARC XML record now using the metadata provided by the user.**

Usage

Strongly recommended: use the straining system prompt

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Geraldine/qwen3-0.6B-unimarc-grpo"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model=AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

user_prompt="""
Title: Notes from a Kidwatcher
Author: SANDRA WILDE
Price: 3.52$
Publisher: Heinemann; First Edition (May 20, 1996)
Language: English
Paperback: 316 pages
ISBN 10: 0435088688
ISBN 13: 978-0435088682
Item Weight: 1.05 pounds
Dimensions: 6.03 x 0.67 x 8.95 inches
Notes: 	Contains 23 selected articles by this influential writer, researcher, educator, and speaker. They're grouped around six major themes inherent in teacher education: culture and community; miscue analysis, reading strategies and comprehension; print awareness and the roots of literacy; the writing process; kidwatching; and whole language theory. No index. Annotation c. by Book News, Inc., Portland, Or.
Categories: Books;Reference;Words, Language & Grammar
"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True
).to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=4096,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    min_p=0,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id
)
output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist() 
# parsing thinking content
try:
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Evaluation

The model was rewarded using three strategies:

  • Format reward: Ensures structural conformity to the XML schema
  • Accuracy reward: Field-level string similarity using difflib
  • Semantic reward: Matches metadata values to expected UNIMARC subfields using fuzzywuzzy

Limitations

  • Input metadata must be reasonably clean and interpretable
  • The model may hallucinate plausible XML when fields are missing
  • Currently optimized for monographic records (books)
Downloads last month
7
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Geraldine/qwen3-0.6B-unimarc-grpo

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(418)
this model
Quantizations
1 model

Dataset used to train Geraldine/qwen3-0.6B-unimarc-grpo

Space using Geraldine/qwen3-0.6B-unimarc-grpo 1