Model Description

MeXtract 3B is a light-weight model for metadata extraction from scientific papers. The model was created by finetuning Qwen2.5 3B Instruct on synthetically generated dataset. Metadata attributes are defined using schema-based apporach where for each attribute we define the Type, min length and max lenght, and options if possible.

Usage

Follow the instructions from MeXtract to install all the dendencies then

from schema import TextSchema
from type_classes import *
from search import extract


class ExampleSchema(TextSchema):
    Name: Field(Str, 1, 5)
    Hobbies: Field(List[Str], 1, 1, ['Hiking', 'Swimming', 'Reading'])
    Age : Field(Int, 1, 100)
    Married: Field(Bool, 1, 1)

text = """
My name is Zaid. I am 25 years old. I like swimming and reading. I am is married. 
"""
metadata = extract(
    text, "IVUL-KAUST/MeXtract-3B", schema=ExampleSchema, backend = "transformers"   
)
print(metadata)

## {'Name': 'Zaid', 'Hobbies': ['Swimming'], 'Age': 25, 'Married': True}

Model Details

Developed by: IVUL at KAUST
Model type: The model is based on transformers as it was finetuned from Qwen2.5
Language(s): languages supported in the model if it is an LLM
Datasets: we use synthetically generated dataset

Evaluation Results

The dataset is evaluated on the MOLE+.

Model	ar	en	jp	fr	ru	multi	model	Average
Falcon3 3B Instruct	20.46	16.30	20.29	17.81	17.23	16.13	15.96	17.74
Llama3.2 3B Instruct	28.77	25.17	33.14	27.73	22.21	22.58	33.37	27.57
Gemma 3 4B It	44.88	46.50	48.46	43.85	46.06	42.05	56.04	46.83
Qwen2.5 3B Instruct	49.99	56.72	61.13	57.08	64.10	52.07	59.05	57.16
MOLE 3B	23.03	50.88	50.83	50.05	57.72	43.34	17.17	41.86
Nuextract 2.0 4B	44.61	43.57	43.82	48.96	47.78	40.14	49.90	45.54
Nuextract 2.0 8B	51.93	58.93	62.11	58.41	63.21	38.21	53.70	55.21
MeXtract 0.5B	65.96	69.95	73.79	68.42	72.07	68.20	32.41	64.40
MeXtract 1.5B	67.06	73.71	75.08	71.57	76.28	71.87	52.05	69.66
MeXtract 3B	70.81	78.02	78.32	72.87	77.51	74.92	60.18	73.23

Use and Limitations

Limitations and Bias

the model is optimized for metadata extraction, it might not work for regular NLP tasks.

License

the model is licensed under Apache 2.0

Citation

@misc{mextract,
      title={MeXtract: Light-Weight Metadata Extraction from Scientific Papers}, 
      author={Zaid Alyafeai and Maged S. Al-Shaibani and Bernard Ghanem},
      year={2025},
      eprint={2510.06889},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.06889}, 
}