Model Description

This model is a fine-tuned dbmdz/bert-base-italian-xxl-cased for multi-label text classification on Eutekne domande_fitlri_materie train dataset.

Intended Use

This model is intended for classifying legal questions into 216 categories. These categories are based on the Eutekne domande_fitlri_materie.

Training Details

Base Model

This model is a finetuning of: dbmdz/bert-base-italian-xxl-cased

Training Data

{
  "train_samples": 4044,
  "val_samples": 867,
  "test_samples": 867,
  "num_labels": 216
}

Training Hyperparameters

{
  "batch_size": 32,
  "learning_rate": 2e-05,
  "num_epochs": 5,
  "max_length": 512,
  "threshold": 0.5
}

Evaluation Results

|                      |   validation_results |   test_results |
|:---------------------|---------------------:|---------------:|
| eval_exact_match     |           0.310265   |     0.303345   |
| eval_hamming_loss    |           0.00932868 |     0.00976654 |
| eval_f1_micro        |           0.589038   |     0.566896   |
| eval_f1_macro        |           0.142712   |     0.157367   |
| eval_precision_micro |           0.773795   |     0.746259   |
| eval_precision_macro |           0.206659   |     0.234683   |
| eval_recall_micro    |           0.475503   |     0.457045   |
| eval_recall_macro    |           0.118758   |     0.13034    |

|     |   hit_rate |   precision |   recall |     f1 |   ndcg |   coverage |    mrr |
|:----|-----------:|------------:|---------:|-------:|-------:|-----------:|-------:|
| @1  |     0.7255 |      0.7255 |   0.2076 | 0.302  | 0.7255 |     0.2076 | 0.2076 |
| @3  |     0.8316 |      0.5398 |   0.4278 | 0.4389 | 0.6316 |     0.4278 | 0.3084 |
| @5  |     0.8674 |      0.3852 |   0.4841 | 0.3924 | 0.5862 |     0.4841 | 0.3214 |
| @10 |     0.9193 |      0.2261 |   0.546  | 0.2955 | 0.5756 |     0.546  | 0.3297 |

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import hf_hub_download
import pickle
import torch
import numpy as np

repo_id = "giacomorossojakala/dbmdz-bert-base-italian-xxl-cased-eutekne-filtri-materia-lv2"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)

# Download and load label encoder
downloaded_path = hf_hub_download(repo_id=repo_id, filename="label_encoder.pkl")
with open(downloaded_path, 'rb') as f:
    mlb = pickle.load(f)

custom_text = "agevolazioni acquisto prima casa"
inputs = tokenizer(custom_text, truncation=True, padding=True, max_length=512, return_tensors="pt")

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
probabilities = torch.sigmoid(logits)
predictions = (probabilities > 0.75).int().cpu().numpy()
predicted_labels = mlb.inverse_transform(predictions)
ranked_idexes = np.argsort(probabilities.cpu().numpy(), axis=1)[:, ::-1]
ranked_labels = np.array(mlb.classes_)[ranked_idexes]

print(f"Custom Text: agevolazioni acquisto prima casa")
print(f"Predicted Labels: ('V',)")
print(f"Ranked Labels: {'[' +', '.join(ranked_labels[0, :5]) + '...]'}")

Downloads last month: 22

Safetensors

Model size

0.1B params

Tensor type

F32