MeXtract
Collection
Light-weight models for metadata extraction
•
3 items
•
Updated
•
1
MeXtract 3B is a light-weight model for metadata extraction from scientific papers. The model was created by finetuning Qwen2.5 3B Instruct on synthetically generated dataset. Metadata attributes are defined using schema-based apporach where for each attribute we define the Type, min length and max lenght, and options if possible.
Follow the instructions from MeXtract to install all the dendencies then
from schema import TextSchema
from type_classes import *
from search import extract
class ExampleSchema(TextSchema):
Name: Field(Str, 1, 5)
Hobbies: Field(List[Str], 1, 1, ['Hiking', 'Swimming', 'Reading'])
Age : Field(Int, 1, 100)
Married: Field(Bool, 1, 1)
text = """
My name is Zaid. I am 25 years old. I like swimming and reading. I am is married.
"""
metadata = extract(
text, "IVUL-KAUST/MeXtract-3B", schema=ExampleSchema, backend = "transformers"
)
print(metadata)
## {'Name': 'Zaid', 'Hobbies': ['Swimming'], 'Age': 25, 'Married': True}
The dataset is evaluated on the MOLE+.
Model | ar | en | jp | fr | ru | multi | model | Average |
---|---|---|---|---|---|---|---|---|
Falcon3 3B Instruct | 20.46 | 16.30 | 20.29 | 17.81 | 17.23 | 16.13 | 15.96 | 17.74 |
Llama3.2 3B Instruct | 28.77 | 25.17 | 33.14 | 27.73 | 22.21 | 22.58 | 33.37 | 27.57 |
Gemma 3 4B It | 44.88 | 46.50 | 48.46 | 43.85 | 46.06 | 42.05 | 56.04 | 46.83 |
Qwen2.5 3B Instruct | 49.99 | 56.72 | 61.13 | 57.08 | 64.10 | 52.07 | 59.05 | 57.16 |
MOLE 3B | 23.03 | 50.88 | 50.83 | 50.05 | 57.72 | 43.34 | 17.17 | 41.86 |
Nuextract 2.0 4B | 44.61 | 43.57 | 43.82 | 48.96 | 47.78 | 40.14 | 49.90 | 45.54 |
Nuextract 2.0 8B | 51.93 | 58.93 | 62.11 | 58.41 | 63.21 | 38.21 | 53.70 | 55.21 |
MeXtract 0.5B | 65.96 | 69.95 | 73.79 | 68.42 | 72.07 | 68.20 | 32.41 | 64.40 |
MeXtract 1.5B | 67.06 | 73.71 | 75.08 | 71.57 | 76.28 | 71.87 | 52.05 | 69.66 |
MeXtract 3B | 70.81 | 78.02 | 78.32 | 72.87 | 77.51 | 74.92 | 60.18 | 73.23 |
the model is optimized for metadata extraction, it might not work for regular NLP tasks.
the model is licensed under Apache 2.0
@misc{mextract,
title={MeXtract: Light-Weight Metadata Extraction from Scientific Papers},
author={Zaid Alyafeai and Maged S. Al-Shaibani and Bernard Ghanem},
year={2025},
eprint={2510.06889},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.06889},
}