File size: 3,935 Bytes
bbc621c
 
 
 
 
 
 
 
 
 
 
 
 
 
5dc509a
bbc621c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
654e5c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: apache-2.0
language:
- pl
- en
- de
base_model:
- EuroBERT/EuroBERT-610m
tags:
- sentence-transformers
- '- embeddings'
- plwordnet
- semantic-relations
- semantic-search
pipeline_tag: sentence-similarity
---

# PLWordNet Semantic Embedder (bi-encoder)

A Polish semantic embedder trained on pairs constructed from plWordNet (Słowosieć) semantic relations and external descriptions of meanings. 
Every relation between lexical units and synsets is transformed into training/evaluation examples. 

The dataset mixes meanings’ usage signals: emotions, definitions, and external descriptions (Wikipedia, sentence-split).
The embedder mimics semantic relations: it pulls together embeddings that are linked by “positive” relations 
(e.g., synonymy, hypernymy/hyponymy as defined in the dataset) and pushes apart embeddings linked by “negative” 
relations (e.g., antonymy or mutually exclusive relations). Source code and training scripts:
- GitHub: [https://github.com/radlab-dev-group/radlab-plwordnet](https://github.com/radlab-dev-group/radlab-plwordnet)

## Model summary

- **Architecture**: bi-encoder built with `sentence-transformers` (transformer encoder + pooling).
- **Use cases**: semantic similarity and semantic search for Polish words, senses, definitions, and sentences.
- **Objective**: CosineSimilarityLoss on positive/negative pairs.
- **Behavior**: preserves the topology of semantic relations derived from plWordNet.

## Training data

Constructed from plWordNet relations between lexical units and synsets; each relation yields example pairs. 
Augmented with:
  - definitions,
  - usage examples (including emotion annotations where available),
  - external descriptions from Wikipedia (split into sentences).

Positive pairs correspond to relations expected to increase similarity; 
negative pairs correspond to relations expected to decrease similarity.
Additional hard/soft negatives may include unrelated meanings.

## Training details
- **Trainer**: `SentenceTransformerTrainer`
- **Loss**: `CosineSimilarityLoss`
- **Evaluator**: `EmbeddingSimilarityEvaluator` (cosine)
- Typical **hyperparameters**:
    - epochs: 5
    - per-device batch size: 10 (gradient accumulation: 4)
    - learning rate: 5e-6 (AdamW fused)
    - weight decay: 0.01
    - warmup: ratio 20k steps
    - fp16: true

## Evaluation
- **Task**: semantic similarity on dev/test splits built from the relation-derived pairs.
- **Metric**: cosine-based correlation (Spearman/Pearson) where applicable, or discrimination between positive vs. negative pairs.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/DCepnAcPcv4EblAmtgu7R.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/TWHyVDItYwNbFEyI0i--n.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/o-CFHkDYw62Lyh1MKvG4M.png)


## How to use

Sentence-Transformers:
``` python
# Python
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("radlab/semantic-euro-bert-encoder-v1", trust_remote_code=True)

texts = ["zamek", "drzwi", "wiadro", "horyzont", "ocean"]
emb = model.encode(texts, convert_to_tensor=True, normalize_embeddings=True)
scores = util.cos_sim(emb, emb)
print(scores)  # higher = more semantically similar
```

Transformers (feature extraction):
``` python
# Python
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

name = "radlab/semantic-euro-bert-encoder-v1"
tok = AutoTokenizer.from_pretrained(name)
mdl = AutoModel.from_pretrained(name, trust_remote_code=True)

texts = ["student", "żak"]
tokens = tok(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    out = mdl(**tokens)
    emb = out.last_hidden_state.mean(dim=1)
    emb = F.normalize(emb, p=2, dim=1)

sim = emb @ emb.T
print(sim)
```