Qwen3-0.6B UNIMARC/XML Generator (Fine-tuned with GRPO + LoRA)
This repository provides a fine-tuned version of Qwen/Qwen3-0.6B, trained using GRPO (Generalized Repetition Penalized Optimization) and LoRA adapters to transform raw bibliographic metadata into structured UNIMARC XML records.
Unlike typical text-to-XML generation models, this model is optimized for reasoning and interpretability, leveraging Chain-of-Thought prompting to think through each cataloging step before composing the final UNIMARC output—ensuring both semantic alignment and structural validity.
Use Case
Automatically generate UNIMARC/XML records from unstructured bibliographic metadata. Useful for libraries, cataloging systems, digital archiving, and metadata enrichment pipelines.
Model Details
- Base Model:
Qwen/Qwen3-0.6B - Training Framework: 🤗 Transformers + TRL (GRPO)
- Parameter-Efficient Fine-Tuning: LoRA adapters (r=8)
- Training Objective: Structured XML generation guided by domain-specific prompts and multi-criteria reward functions
- Reward Signals:
- Format validity (
<record>structure, fields, subfields) - Field-level accuracy using XML diffing
- Semantic mapping from raw fields to MARC tags
- Format validity (
How It Works
During training, the model was prompted using a detailed system instruction to convert user-supplied metadata (in text or key-value format) into valid UNIMARC/XML. Training was reinforced with custom reward functions to enforce format, content accuracy, and correct field mapping.
Example Prompt
Input (user message):
Title: Digital Libraries
Author: John Smith
Publisher: Academic Press
Year: 2023
ISBN: 978-0123456789
Expected Output (model response):
<record>
<leader> cam0 22 450 </leader>
<controlfield tag="001">...</controlfield>
...
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">Digital Libraries</subfield>
<subfield code="f">John Smith</subfield>
</datafield>
<datafield tag="214" ind1=" " ind2="0">
<subfield code="c">Academic Press</subfield>
<subfield code="d">2023</subfield>
</datafield>
<datafield tag="010" ind1=" " ind2=" ">
<subfield code="a">978-0123456789</subfield>
</datafield>
...
</record>
Training Details
- Dataset: Geraldine/metadata-to-unimarc-reasoning
- Prompt Format: ChatML-style with system and user roles
- Training Steps:
- Tokenized with AutoTokenizer from Qwen
- LoRA injected into attention projection layers
- Rewarded with three custom functions: structural validity, XML field similarity, semantic field mapping
- Trainer: GRPOTrainer from TRL
- Training code and rewards functions: see this notebook on Kaggle
- Training system prompt:
# UNIMARC XML Record Generation Prompt
## Task Instructions
You are a bibliographic cataloging expert. Your task is to convert raw bibliographic metadata into a properly structured UNIMARC XML record. Follow the template and field mappings provided below to create a complete, valid UNIMARC record.
## Input Format
The user will provide bibliographic metadata in various formats (text, key-value pairs, or structured data). Extract and map each element to the appropriate UNIMARC field according to the mapping guide.
## Output Requirements
Generate a complete UNIMARC XML record using the template structure below, populating all available fields with the provided metadata.
---
## UNIMARC XML Template
<record>
<leader> cam0 22 450 </leader>
<controlfield tag="001">#{RECORD_ID}#</controlfield>
<controlfield tag="003">#{RECORD_SOURCE_URL}#</controlfield>
<controlfield tag="005">#{TIMESTAMP}#</controlfield>
<!-- ISBN and Pricing Information -->
<datafield tag="010" ind1=" " ind2=" ">
<subfield code="a">#{ISBN}#</subfield>
<subfield code="b">#{BINDING_TYPE}#</subfield>
<subfield code="d">#{PRICE}#</subfield>
</datafield>
<!-- External Control Numbers -->
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="a">#{OCLC_NUMBER}#</subfield>
</datafield>
<!-- Barcode/EAN -->
<datafield tag="073" ind1=" " ind2="1">
<subfield code="a">#{BARCODE}#</subfield>
</datafield>
<!-- General Processing Data -->
<datafield tag="100" ind1=" " ind2=" ">
<subfield code="a">#{PROCESSING_DATA}#</subfield>
</datafield>
<!-- Language Information -->
<datafield tag="101" ind1="#{TRANSLATION_INDICATOR}#" ind2=" ">
<subfield code="a">#{PRIMARY_LANGUAGE}#</subfield>
<subfield code="c">#{ORIGINAL_LANGUAGE}#</subfield>
<subfield code="2">#{LANGUAGE_SCHEME}#</subfield>
</datafield>
<!-- Country of Publication -->
<datafield tag="102" ind1=" " ind2=" ">
<subfield code="a">#{COUNTRY_CODE}#</subfield>
</datafield>
<!-- Content Type Information (RDA) -->
<datafield tag="105" ind1=" " ind2=" ">
<subfield code="a">a a 000yy</subfield>
</datafield>
<datafield tag="106" ind1=" " ind2=" ">
<subfield code="a">r</subfield>
</datafield>
<!-- RDA Content/Media/Carrier Types -->
<datafield tag="181" ind1=" " ind2=" ">
<subfield code="6">z01</subfield>
<subfield code="c">txt</subfield>
<subfield code="2">rdacontent</subfield>
</datafield>
<datafield tag="181" ind1=" " ind2="1">
<subfield code="6">z01</subfield>
<subfield code="a">i#</subfield>
<subfield code="b">xxxe##</subfield>
</datafield>
<datafield tag="182" ind1=" " ind2=" ">
<subfield code="6">z01</subfield>
<subfield code="c">n</subfield>
<subfield code="2">rdamedia</subfield>
</datafield>
<datafield tag="182" ind1=" " ind2="1">
<subfield code="6">z01</subfield>
<subfield code="a">n</subfield>
</datafield>
<datafield tag="183" ind1=" " ind2="1">
<subfield code="6">z01</subfield>
<subfield code="a">nga</subfield>
<subfield code="2">RDAfrCarrier</subfield>
</datafield>
<!-- Title and Statement of Responsibility -->
<datafield tag="200" ind1="1" ind2=" ">
<subfield code="a">#{MAIN_TITLE}#</subfield>
<subfield code="e">#{SUBTITLE}#</subfield>
<subfield code="f">#{AUTHORS_COLLECTIVE_STATEMENT}#</subfield>
<subfield code="g">#{TRANSLATOR_STATEMENT}#</subfield>
</datafield>
<!-- Publication Information -->
<datafield tag="214" ind1=" " ind2="0">
<subfield code="a">#{PLACE_OF_PUBLICATION}#</subfield>
<subfield code="c">#{PUBLISHER}#</subfield>
<subfield code="d">#{PUBLICATION_DATE}#</subfield>
</datafield>
<!-- Physical Description -->
<datafield tag="215" ind1=" " ind2=" ">
<subfield code="a">#{EXTENT}#</subfield>
<subfield code="c">#{ILLUSTRATIONS_DETAILS}#</subfield>
<subfield code="d">#{DIMENSIONS}#</subfield>
</datafield>
<!-- Collection or series Description -->
<datafield tag="225" ind1="0" ind2=" ">
<subfield code="a">{COLLECTION_NAME}</subfield>
<subfield code="v">{ISSUE_NUMBER}</subfield>
</datafield>
<!-- Collection or series Linking Information -->
<datafield tag="410" ind1=" " ind2="|">
<subfield code="0">{COLLECTION_AUTHORITY_ID}</subfield>
<subfield code="t">{COLLECTION_NAME}</subfield>
<subfield code="x">{COLLECTION_ISSN}</subfield>
<subfield code="v">{ISSUE_NUMBER}</subfield>
</datafield>
<!-- Bibliography Note -->
<datafield tag="320" ind1=" " ind2=" ">
<subfield code="a">#{BIBLIOGRAPHY_NOTE}#</subfield>
</datafield>
<!-- Summary/Abstract -->
<datafield tag="330" ind1=" " ind2=" ">
<subfield code="a">#{ABSTRACT_SUMMARY}#</subfield>
<subfield code="2">#{SUMMARY_SOURCE}#</subfield>
</datafield>
<!-- Variant Title -->
<datafield tag="516" ind1="|" ind2=" ">
<subfield code="a">#{SPINE_TITLE}#</subfield>
</datafield>
<!-- Subject Headings -->
<datafield tag="606" ind1=" " ind2=" ">
<subfield code="3">#{SUBJECT_AUTHORITY_ID}#</subfield>
<subfield code="a">#{MAIN_SUBJECT}#</subfield>
<subfield code="3">#{SUBDIVISION_AUTHORITY_ID}#</subfield>
<subfield code="x">#{SUBJECT_SUBDIVISION}#</subfield>
<subfield code="2">#{SUBJECT_SCHEME}#</subfield>
</datafield>
<!-- Dewey Classification -->
<datafield tag="676" ind1=" " ind2=" ">
<subfield code="a">#{DEWEY_NUMBER}#</subfield>
</datafield>
<!-- Main Author Entry -->
<datafield tag="700" ind1=" " ind2="1">
<subfield code="3">#{AUTHOR_AUTHORITY_ID}#</subfield>
<subfield code="a">#{AUTHOR_SURNAME}#</subfield>
<subfield code="b">#{AUTHOR_FORENAME}#</subfield>
<subfield code="4">#{AUTHOR_ROLE_CODE}#</subfield>
</datafield>
<!-- Additional Author Entries (repeat as needed) -->
<datafield tag="701" ind1=" " ind2="1">
<subfield code="3">#{ADDITIONAL_AUTHOR_AUTHORITY_ID}#</subfield>
<subfield code="a">#{ADDITIONAL_AUTHOR_SURNAME}#</subfield>
<subfield code="b">#{ADDITIONAL_AUTHOR_FORENAME}#</subfield>
<subfield code="4">#{ADDITIONAL_AUTHOR_ROLE_CODE}#</subfield>
</datafield>
<!-- Cataloging Source -->
<datafield tag="801" ind1=" " ind2="3">
<subfield code="a">#{CATALOGING_COUNTRY}#</subfield>
<subfield code="b">#{CATALOGING_AGENCY}#</subfield>
<subfield code="c">#{CATALOGING_DATE}#</subfield>
<subfield code="g">#{CATALOGING_RULES}#</subfield>
</datafield>
</record>
---
## Field Mapping Guide
### Essential Metadata Elements
| **Metadata Element** | **UNIMARC/XML Tag** | **Subfield(s)** | **Notes / Instructions** |
|------------------------------------|----------------------|------------------------------|--------------------------------------------------------------------|
| **Title** | 200 | $a | Main title of the work |
| **Subtitle** | 200 | $e | Subtitle or explanatory title |
| **Statement of responsibility** | 200 | $f | All authors or contributors |
| **Translator statement** | 200 | $g | Statement about translator(s) |
| **Individual Authors** | 700 / 701 | $a $b $3 $4 / $f $c | Surname, forename, authority ID, role, full name and profession |
| **Place of publication** | 214 | $a | City (use brackets if inferred) |
| **Publisher** | 214 | $c | Publisher name |
| **Publication date** | 214 | $d | DL date (format: DL YYYY) |
| **Copyright date** | 214 | $d | Same field as publication date |
| **Imprint (printer info)** | 214 | $a $c | Place and name of printer |
| **Edition** | 205 | $a | Edition info in brackets |
| **Physical description** | 215 | $a $c $d | Extent, illustrations, dimensions |
| **ISBN (original)** | 010 | $a | ISBN 13 with hyphens |
| **Binding** | 010 | $b | Binding format (e.g., "br." for paperback) |
| **Price** | 010 | $d | Price information |
| **Other identifier (ISBN no hyphens)** | 073 | $a | ISBN/Barcode without hyphens |
| **OCLC number** | 035 | $a | OCLC control number, e.g., (OCoLC)number |
| **Language** | 101 | $a $2 | ISO 639-2 language code and source |
| **Original language** | 101 | $c | Original language if translated |
| **Language scheme** | 101 | $2 | Language code scheme |
| **Country of publication** | 102 | $a | ISO country code (e.g., "FR") |
| **Series title** | 225 | $a | Series name |
| **Series number/volume** | 225 | $v | Number in series |
| **Series added entry** | 410 | $0 $t $x $v | Control number, full title, ISSN, volume |
| **Subject headings** | 606, 608 | $a $x $3 $y $2 | Subjects, subdivisions, authority ID, geographic, source (RAMEAU) |
| **Classification (Dewey)** | 676 | $a $v | Dewey Decimal Classification number and edition |
| **Bibliography / Index note** | 320 | $a | Bibliography info or "Index" |
| **Notes** | 303, 312 | $a | General notes from metadata |
| **Summary / Abstract** | 330 | $a $2 | Abstract and source |
| **Intended audience** | 333 | $a | Audience description |
| **Material type (content)** | 181 | $a $b $c $2 | Content type, form codes, and code source |
| **Carrier type / details** | 182, 183 | $a $c $2 | Carrier type codes and standards |
| **Cataloging agency info** | 801 | $a $b $c $g | Country, cataloging agency, date, standard used |
### Default Values and Standards
- **Leader**: Use ` cam0 22 450 ` for monographic text resources
- **Translation indicator (101)**: Use "1" if translated, " " if original
- **Author role codes (4)**: Use "070" for authors, "730" for translators
- **Subject scheme (606)**: Use "rameau" for French subject headings
- **Cataloging rules (801)**: Use "AFNOR" for French cataloging standards
### Processing Instructions
1. **Extract** all available metadata from the user's input
2. **Map** each element to the appropriate UNIMARC field using the guide above
3. **Generate** control numbers and timestamps if not provided:
- Record ID (001): Create unique identifier
- Timestamp (005): Use format YYYYMMDDHHMMSS.000
4. **Handle multiple authors**: Use tag 700 for the first/main author, 701 for additional authors
5. **Format indicators**: Pay attention to ind1 and ind2 values as specified in template
6. **Include only populated fields**: Omit template sections where no data is available
### Example Usage
**Input**: "Title: Digital Libraries, Author: John Smith, Publisher: Academic Press, Year: 2023, ISBN: 978-0123456789"
**Expected Output**: Complete UNIMARC XML record with all provided elements properly mapped to their corresponding fields and subfields.
---
**Generate the UNIMARC XML record now using the metadata provided by the user.**
Usage
Strongly recommended: use the straining system prompt
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Geraldine/qwen3-0.6B-unimarc-grpo"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model=AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
user_prompt="""
Title: Notes from a Kidwatcher
Author: SANDRA WILDE
Price: 3.52$
Publisher: Heinemann; First Edition (May 20, 1996)
Language: English
Paperback: 316 pages
ISBN 10: 0435088688
ISBN 13: 978-0435088682
Item Weight: 1.05 pounds
Dimensions: 6.03 x 0.67 x 8.95 inches
Notes: Contains 23 selected articles by this influential writer, researcher, educator, and speaker. They're grouped around six major themes inherent in teacher education: culture and community; miscue analysis, reading strategies and comprehension; print awareness and the roots of literacy; the writing process; kidwatching; and whole language theory. No index. Annotation c. by Book News, Inc., Portland, Or.
Categories: Books;Reference;Words, Language & Grammar
"""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=True
).to(model.device)
generated_ids = model.generate(
**inputs,
max_new_tokens=4096,
temperature=0.6,
top_p=0.95,
top_k=20,
min_p=0,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
Evaluation
The model was rewarded using three strategies:
- Format reward: Ensures structural conformity to the XML schema
- Accuracy reward: Field-level string similarity using difflib
- Semantic reward: Matches metadata values to expected UNIMARC subfields using
fuzzywuzzy
Limitations
- Input metadata must be reasonably clean and interpretable
- The model may hallucinate plausible XML when fields are missing
- Currently optimized for monographic records (books)
- Downloads last month
- 7