YOLO-gen 11x-OBB: A Foundational Model for Codicological Layout Analysis

This repository contains the weights and configuration for YOLO-gen 11x-OBB, a generalist object detection model specialized for Document Layout Analysis (DLA) on a wide range of historical manuscripts.

Unlike models trained on a single corpus, YOLO-gen is the result of a novel data harmonization methodology. It was trained on a unified dataset created by merging three distinct and non-interoperable corpora of historical documents, using a sophisticated hierarchical ontology to reconcile their different annotation schemes.

This makes YOLO-gen a powerful foundational model, intended as a robust starting point for researchers and projects that need to perform layout analysis on diverse collections of Western manuscripts (ca. 12th-17th c.) without training a new model from scratch for each document type.

The model was developed by Sergio Torres Aguilar at the University of Luxembourg.

Model Details

Architecture: This model uses the YOLOv11x architecture with an Oriented Bounding Box (OBB) head, making it particularly effective at detecting rotated or non-rectangular layout elements common in manuscripts.
Ontology: The model was trained on a hierarchical, multi-label ontology (V7) designed to be both codicologically meaningful and visually coherent. Each object in the training data was tagged with its full path in the hierarchy (e.g., a simple initial was tagged as Initial, Initial_Manuscript, and Initial_Ms_Simple). This provides a rich training signal and enables the model to recognize abstract concepts.
Parent Classes: The model can identify high-level conceptual categories, a unique feature not present in specialist models. The main parent classes are: Text, Decoration, Initial, Marks, Damage, Numbering, and the intermediate parent Paratext.

Intended Uses & Limitations

Intended Use

This model is intended for academic and research use as a strong baseline for Document Layout Analysis on historical manuscripts. It is particularly useful for:

Projects working with diverse collections of manuscripts where training a specialist model for each type is not feasible.
Initializing new DLA projects with a robust, pre-trained detector that understands fundamental codicological structures.
Detecting high-level layout categories (e.g., finding all Decoration or all Initial elements on a page).

Limitations

Performance vs. Specialists: While highly competitive, this generalist model may be slightly outperformed by a model trained exclusively on a single, specific corpus (e.g., a model trained only on the HORAE dataset may be better at detecting HORAE-specific features).
Recall on Fine-Grained Subclasses: The model can sometimes be overly cautious, resulting in lower recall for certain specific subclasses (e.g., Initial_Ms_Simple).
Out-of-Domain Performance: The model was trained on medieval and early modern European manuscripts. Its performance on other domains (e.g., modern documents, non-Latin scripts) is not guaranteed.

Training Data

YOLO-gen was trained on a unified dataset created by merging the following three public corpora. The harmonization was achieved through a custom hierarchical ontology described in the accompanying paper.

e-NDP: A corpus of Parisian medieval registers (1326-1504) with a relatively homogeneous administrative layout.
- Link: https://doi.org/10.5281/zenodo.7575693
CATMuS: A diverse multi-class dataset derived from various medieval and modern sources (ca. 12th-17th c.), including administrative, literary, and printed documents.
- Link: https://huggingface.co/datasets/CATMuS/medieval-segmentation
HORAE: A corpus of richly decorated Books of Hours (ca. 13th-16th c.) with complex and artistic layouts.
- Link: https://github.com/oriflamms/HORAE

Evaluation

The model was trained for 120 epochs. The final performance was evaluated on a combined test set containing held-out images from all three source corpora, using standard COCO metrics for Oriented Bounding Boxes.

Overall Performance

Metric	Value
mAP@.50:.95 (all classes)	0.558
mAP@.50 (all classes)	0.740
Precision (all classes)	0.680
Recall (all classes)	0.704

Performance on Abstract Parent Classes

A key feature of YOLO-gen is its ability to recognize high-level conceptual classes. The performance on these parent and intermediate classes is as follows:

Parent/Intermediate Class	mAP@.50:.95	mAP@.50	Precision	Recall
Text	0.675	0.861	0.749	0.861
Decoration	0.629	0.839	0.712	0.902
Initial (Universal)	0.662	0.880	0.748	0.878
Marks	0.665	0.821	0.643	0.900
Numbering	0.422	0.776	0.611	0.820
Paratext (Intermediate)	0.461	0.674	0.643	0.658
Initial_Manuscript (Inter.)	0.416	0.519	0.801	0.225
Initial_Printed (Inter.)	0.477	0.720	0.755	0.597

Besides, the model is also able to recognize the original annotations from the 3 above mentioned corpora

How to Use

The model can be easily loaded and used with the ultralytics Python library.

from ultralytics import YOLO

# Load the model from the Hugging Face Hub
model = YOLO('your_huggingface_username/YOLO-gen-11x-OBB') # Replace with your user / repo name

# Run inference on an image
image_path = 'path/to/your/manuscript_page.jpg'
results = model.predict(image_path)

# Process results
# Note: The model performs OBB detection, so results will have xyxyxyxy coordinates.
for r in results:
    for box in r.obb:
        class_id = int(box.cls)
        class_name = model.names[class_id]
        confidence = float(box.conf)
        coordinates = box.xyxyxyxy.tolist()
        
        print(f"Detected {class_name} with confidence {confidence:.2f} at {coordinates}")

Citation

@article{aguilar2025codicologycodecomparativestudy,
    title={From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents},
    author={Torres Aguilar, Sergio},
    url={https://arxiv.org/abs/2506.20326},
    year={2025},
    note = {working paper or preprint}
}