---
license: apache-2.0
datasets:
- IVUL-KAUST/MOLE-plus
metrics:
- f1
base_model:
- Qwen/Qwen2.5-0.5B-Instruct
---

## Model Description
MeXtract 0.5B is a light-weight model for metadata extraction from scientific papers. The model was created by finetuning Qwen2.5 0.5B Instruct
on synthetically generated dataset. Metadata attributes are defined using schema-based apporach where for each attribute we define the Type,
min length and max lenght, and options if possible. 

## Usage

Follow the instructions from [MeXtract](https://github.com/IVUL-KAUST/MeXtract) to install all the dendencies then

```python
from schema import TextSchema
from type_classes import *
from search import extract


class ExampleSchema(TextSchema):
    Name: Field(Str, 1, 5)
    Hobbies: Field(List[Str], 1, 1, ['Hiking', 'Swimming', 'Reading'])
    Age : Field(Int, 1, 100)
    Married: Field(Bool, 1, 1)

text = """
My name is Zaid. I am 25 years old. I like swimming and reading. I am is married. 
"""
metadata = extract(
    text, "IVUL-KAUST/MeXtract-0.5B", schema=ExampleSchema, backend = "transformers"   
)
print(metadata)

## {'Name': 'Zaid', 'Hobbies': ['Swimming'], 'Age': 25, 'Married': True}
```

## Model Details
- Developed by: IVUL at KAUST
- Model type: The model is based on transformers as it was finetuned from Qwen2.5
- Language(s): languages supported in the model if it is an LLM
- Datasets: we use synthetically generated dataset

## Evaluation Results

The dataset is evaluated on the [MOLE+](https://huggingface.co/IVUL-KAUST/MOLE-plus). 

| **Model**                | **ar**    | **en**    | **jp**    | **fr**    | **ru**    | **multi** | **model** | **Average** |
| ------------------------ | --------- | --------- | --------- | --------- | --------- | --------- | --------- | ----------- |
| **Falcon3 3B Instruct**  | 20.46     | 16.30     | 20.29     | 17.81     | 17.23     | 16.13     | 15.96     | 17.74       |
| **Llama3.2 3B Instruct** | 28.77     | 25.17     | 33.14     | 27.73     | 22.21     | 22.58     | 33.37     | 27.57       |
| **Gemma 3 4B It**        | 44.88     | 46.50     | 48.46     | 43.85     | 46.06     | 42.05     | 56.04     | 46.83       |
| **Qwen2.5 3B Instruct**  | 49.99     | 56.72     | 61.13     | 57.08     | 64.10     | 52.07     | 59.05     | 57.16       |
| **MOLE 3B**              | 23.03     | 50.88     | 50.83     | 50.05     | 57.72     | 43.34     | 17.17     | 41.86       |
| **Nuextract 2.0 4B**     | 44.61     | 43.57     | 43.82     | 48.96     | 47.78     | 40.14     | 49.90     | 45.54       |
| **Nuextract 2.0 8B**     | 51.93     | 58.93     | 62.11     | 58.41     | 63.21     | 38.21     | 53.70     | 55.21       |
| **MeXtract 0.5B**        | 65.96     | 69.95     | 73.79     | 68.42     | 72.07     | 68.20     | 32.41     | 64.40       |
| **MeXtract 1.5B**        | 67.06     | 73.71     | 75.08     | 71.57     | 76.28     | 71.87     | 52.05     | 69.66       |
| **MeXtract 3B**          | **70.81** | **78.02** | **78.32** | **72.87** | **77.51** | **74.92** | **60.18** | **73.23**   |


## Use and Limitations

### Limitations and Bias
the model is optimized for metadata extraction, it might not work for regular NLP tasks. 

## License
the model is licensed under Apache 2.0 

## Citation
@misc{mextract,
      title={MeXtract: Light-Weight Metadata Extraction from Scientific Papers}, 
      author={Zaid Alyafeai and Maged S. Al-Shaibani and Bernard Ghanem},
      year={2025},
      eprint={2510.06889},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.06889}, 
}