Model Description

This model is a fine-tuned dbmdz/bert-base-italian-xxl-cased for multi-label text classification on Eutekne domande_fitlri_materie train dataset.

Intended Use

This model is intended for classifying legal questions into 31 categories. These categories are based on the Eutekne domande_fitlri_materie.

Training Details

Base Model

This model is a finetuning of: dbmdz/bert-base-italian-xxl-cased

Training Data

{
  "train_samples": 4044,
  "val_samples": 867,
  "test_samples": 867,
  "num_labels": 31
}

Training Hyperparameters

{
  "batch_size": 32,
  "learning_rate": 2e-05,
  "num_epochs": 20,
  "max_length": 512,
  "threshold": 0.5
}

Evaluation Results

|                                 |   validation_results |   test_results |
|:--------------------------------|---------------------:|---------------:|
| eval_exact_match                |            0.489043  |      0.472895  |
| eval_hamming_loss               |            0.0273096 |      0.0298024 |
| eval_f1_micro                   |            0.697444  |      0.674258  |
| eval_f1_macro                   |            0.537159  |      0.57959   |
| eval_precision_micro            |            0.725557  |      0.684558  |
| eval_precision_macro            |            0.593484  |      0.626934  |
| eval_recall_micro               |            0.671429  |      0.664263  |
| eval_recall_macro               |            0.511322  |      0.574265  |
| eval_best_threshold             |            0.6       |      0.6       |
| eval_avg_predictions_per_sample |            1.34487   |      1.39677   |
| eval_avg_labels_per_sample      |            1.45329   |      1.43945   |

|     |   hit_rate |   precision |   recall |     f1 |   ndcg |   coverage |    mrr |
|:----|-----------:|------------:|---------:|-------:|-------:|-----------:|-------:|
| @1  |     0.7589 |      0.7589 |   0.2243 | 0.3217 | 0.7589 |     0.2243 | 0.2243 |
| @3  |     0.9043 |      0.3806 |   0.2963 | 0.3011 | 0.5013 |     0.2963 | 0.2571 |
| @5  |     0.9377 |      0.2496 |   0.3149 | 0.2514 | 0.4421 |     0.3149 | 0.2614 |
| @10 |     0.9746 |      0.1349 |   0.3334 | 0.176  | 0.4168 |     0.3334 | 0.2639 |

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import hf_hub_download
import pickle
import torch
import numpy as np

repo_id = "giacomorossojakala/dbmdz-bert-base-italian-xxl-cased-eutekne-filtri-materia-lv1"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)

# Download and load label encoder
downloaded_path = hf_hub_download(repo_id=repo_id, filename="label_encoder.pkl")
with open(downloaded_path, 'rb') as f:
    mlb = pickle.load(f)

custom_text = "agevolazioni acquisto prima casa"
inputs = tokenizer(custom_text, truncation=True, padding=True, max_length=512, return_tensors="pt")

model.eval()
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
probabilities = torch.sigmoid(logits)
predictions = (probabilities > 0.75).int().cpu().numpy()
predicted_labels = mlb.inverse_transform(predictions)
ranked_idexes = np.argsort(probabilities.cpu().numpy(), axis=1)[:, ::-1]
ranked_labels = np.array(mlb.classes_)[ranked_idexes]

print(f"Custom Text: agevolazioni acquisto prima casa")
print(f"Predicted Labels: ('V',)")
print(f"Ranked Labels: {'[' +', '.join(ranked_labels[0, :5]) + '...]'}")

Downloads last month: 38

Safetensors

Model size

0.1B params

Tensor type

F32