giacomorossojakala
/

paraphrase-multilingual-mpnet-base-v2-eutekne-filtri-materia-lv1-2-desc-cascata

Model Description

This model is a fine-tuned paraphrase-multilingual-mpnet-base-v2 for a multi-label text classification on Eutekne domande_fitlri_materie train dataset.

The multi-label text classification task is faced as a retrieval problem: given a query and a list of description of labels, they are encoded with the same ambedder and the neareast neighbors are retrieved.

Intended Use

This model is intended for classifying legal questions into categories. These categories are based on the Eutekne domande_fitlri_materie.

Training Details

Experiment

The Training is preformed using following settings: contrastive_learning - ['level_1_2', 'descrizione_riassunto_cascata']

Base Model

This model is a finetuning of: paraphrase-multilingual-mpnet-base-v2

Training Data

{
  "train_samples": 4044,
  "val_samples": 867,
  "test_samples": 867
}

Training Hyperparameters

{
  "batch_size": 16,
  "learning_rate": 2e-05,
  "num_epochs": 30,
  "max_length": 256
}

Evaluation Results

|              |   validation_results |   test_results |
|:-------------|---------------------:|---------------:|
| accuracy_1   |             0.681661 |       0.663206 |
| accuracy_3   |             0.83045  |       0.817762 |
| accuracy_5   |             0.889273 |       0.861592 |
| accuracy_10  |             0.929642 |       0.916955 |
| precision_1  |             0.681661 |       0.663206 |
| precision_3  |             0.532488 |       0.517878 |
| precision_5  |             0.396078 |       0.376471 |
| precision_10 |             0.233449 |       0.227682 |
| recall_1     |             0.195662 |       0.199829 |
| recall_3     |             0.422025 |       0.41992  |
| recall_5     |             0.495023 |       0.479549 |
| recall_10    |             0.555928 |       0.547916 |
| f1_1         |             0.284624 |       0.286278 |
| f1_3         |             0.434482 |       0.426173 |
| f1_5         |             0.405207 |       0.386656 |
| f1_10        |             0.305581 |       0.297389 |
| mrr_1        |             0.681661 |       0.663206 |
| mrr_3        |             0.749519 |       0.731642 |
| mrr_5        |             0.76313  |       0.741619 |
| mrr_10       |             0.768567 |       0.749094 |
| r_precision  |             0.475959 |       0.468916 |

How to Use

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

repo_id = f"giacomorossojakala/paraphrase-multilingual-mpnet-base-v2-eutekne-filtri-materia-lv1-2-desc-cascata"
model = SentenceTransformer(repo_id, device=device)

downloaded_path = hf_hub_download(repo_id=HF_REPO_ID, filename="label_description.json", token=HF_TOKEN)
label_descriptions = json.load(open(downloaded_path, "r"))

# Get predictions
custom_text = "acquista casa nel 2025 lavori ristrutturazione ma andrà ad abitare nel 2026, detrazioni 50%?"

def classify_text(text, model, label_descriptions, top_k=5):
    '''
    Classify a text by computing similarity with all label descriptions.

    Args:
        text: Input text to classify
        model: Trained SentenceTransformer model
        label_descriptions: Dict mapping label codes to descriptions
        top_k: Number of top predictions to return

    Returns:
        List of (label, similarity_score) tuples, sorted by score (descending)
    '''

    # Get label list and descriptions
    label_list = sorted(list(label_descriptions.keys()))
    label_texts = [label_descriptions[label] for label in label_list]

    # Encode question and all labels
    question_embedding = model.encode(text, convert_to_numpy=True)
    label_embeddings = model.encode(label_texts, convert_to_numpy=True, show_progress_bar=False)

    # Compute cosine similarities
    similarities = cosine_similarity([question_embedding], label_embeddings)[0]

    # Get top-k
    top_k_indices = np.argsort(similarities)[::-1][:top_k]
    predictions = [(label_list[idx], similarities[idx]) for idx in top_k_indices]

    return predictions



predictions = classify_text(custom_text, model, label_descriptions, top_k=5)

Downloads last month: 57

Safetensors

Model size

0.3B params

Tensor type

F32