Model Description
This model is a fine-tuned paraphrase-multilingual-mpnet-base-v2 for a multi-label text classification on Eutekne domande_fitlri_materie train dataset.
The multi-label text classification task is faced as a retrieval problem: given a query and a list of description of labels, they are encoded with the same ambedder and the neareast neighbors are retrieved.
Intended Use
This model is intended for classifying legal questions into categories. These categories are based on the Eutekne domande_fitlri_materie.
Training Details
Experiment
The Training is preformed using following settings: contrastive_learning - ['level_1_2', 'descrizione_riassunto_cascata']
Base Model
This model is a finetuning of: paraphrase-multilingual-mpnet-base-v2
Training Data
{
"train_samples": 4044,
"val_samples": 867,
"test_samples": 867
}
Training Hyperparameters
{
"batch_size": 16,
"learning_rate": 2e-05,
"num_epochs": 30,
"max_length": 256
}
Evaluation Results
| | validation_results | test_results |
|:-------------|---------------------:|---------------:|
| accuracy_1 | 0.681661 | 0.663206 |
| accuracy_3 | 0.83045 | 0.817762 |
| accuracy_5 | 0.889273 | 0.861592 |
| accuracy_10 | 0.929642 | 0.916955 |
| precision_1 | 0.681661 | 0.663206 |
| precision_3 | 0.532488 | 0.517878 |
| precision_5 | 0.396078 | 0.376471 |
| precision_10 | 0.233449 | 0.227682 |
| recall_1 | 0.195662 | 0.199829 |
| recall_3 | 0.422025 | 0.41992 |
| recall_5 | 0.495023 | 0.479549 |
| recall_10 | 0.555928 | 0.547916 |
| f1_1 | 0.284624 | 0.286278 |
| f1_3 | 0.434482 | 0.426173 |
| f1_5 | 0.405207 | 0.386656 |
| f1_10 | 0.305581 | 0.297389 |
| mrr_1 | 0.681661 | 0.663206 |
| mrr_3 | 0.749519 | 0.731642 |
| mrr_5 | 0.76313 | 0.741619 |
| mrr_10 | 0.768567 | 0.749094 |
| r_precision | 0.475959 | 0.468916 |
How to Use
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
repo_id = f"giacomorossojakala/paraphrase-multilingual-mpnet-base-v2-eutekne-filtri-materia-lv1-2-desc-cascata"
model = SentenceTransformer(repo_id, device=device)
downloaded_path = hf_hub_download(repo_id=HF_REPO_ID, filename="label_description.json", token=HF_TOKEN)
label_descriptions = json.load(open(downloaded_path, "r"))
# Get predictions
custom_text = "acquista casa nel 2025 lavori ristrutturazione ma andrà ad abitare nel 2026, detrazioni 50%?"
def classify_text(text, model, label_descriptions, top_k=5):
'''
Classify a text by computing similarity with all label descriptions.
Args:
text: Input text to classify
model: Trained SentenceTransformer model
label_descriptions: Dict mapping label codes to descriptions
top_k: Number of top predictions to return
Returns:
List of (label, similarity_score) tuples, sorted by score (descending)
'''
# Get label list and descriptions
label_list = sorted(list(label_descriptions.keys()))
label_texts = [label_descriptions[label] for label in label_list]
# Encode question and all labels
question_embedding = model.encode(text, convert_to_numpy=True)
label_embeddings = model.encode(label_texts, convert_to_numpy=True, show_progress_bar=False)
# Compute cosine similarities
similarities = cosine_similarity([question_embedding], label_embeddings)[0]
# Get top-k
top_k_indices = np.argsort(similarities)[::-1][:top_k]
predictions = [(label_list[idx], similarities[idx]) for idx in top_k_indices]
return predictions
predictions = classify_text(custom_text, model, label_descriptions, top_k=5)
- Downloads last month
- 57