|
--- |
|
license: apache-2.0 |
|
language: |
|
- pl |
|
- en |
|
- de |
|
base_model: |
|
- EuroBERT/EuroBERT-610m |
|
tags: |
|
- sentence-transformers |
|
- '- embeddings' |
|
- plwordnet |
|
- semantic-relations |
|
- semantic-search |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
# PLWordNet Semantic Embedder (bi-encoder) |
|
|
|
A Polish semantic embedder trained on pairs constructed from plWordNet (Słowosieć) semantic relations and external descriptions of meanings. |
|
Every relation between lexical units and synsets is transformed into training/evaluation examples. |
|
|
|
The dataset mixes meanings’ usage signals: emotions, definitions, and external descriptions (Wikipedia, sentence-split). |
|
The embedder mimics semantic relations: it pulls together embeddings that are linked by “positive” relations |
|
(e.g., synonymy, hypernymy/hyponymy as defined in the dataset) and pushes apart embeddings linked by “negative” |
|
relations (e.g., antonymy or mutually exclusive relations). Source code and training scripts: |
|
- GitHub: [https://github.com/radlab-dev-group/radlab-plwordnet](https://github.com/radlab-dev-group/radlab-plwordnet) |
|
|
|
## Model summary |
|
|
|
- **Architecture**: bi-encoder built with `sentence-transformers` (transformer encoder + pooling). |
|
- **Use cases**: semantic similarity and semantic search for Polish words, senses, definitions, and sentences. |
|
- **Objective**: CosineSimilarityLoss on positive/negative pairs. |
|
- **Behavior**: preserves the topology of semantic relations derived from plWordNet. |
|
|
|
## Training data |
|
|
|
Constructed from plWordNet relations between lexical units and synsets; each relation yields example pairs. |
|
Augmented with: |
|
- definitions, |
|
- usage examples (including emotion annotations where available), |
|
- external descriptions from Wikipedia (split into sentences). |
|
|
|
Positive pairs correspond to relations expected to increase similarity; |
|
negative pairs correspond to relations expected to decrease similarity. |
|
Additional hard/soft negatives may include unrelated meanings. |
|
|
|
## Training details |
|
- **Trainer**: `SentenceTransformerTrainer` |
|
- **Loss**: `CosineSimilarityLoss` |
|
- **Evaluator**: `EmbeddingSimilarityEvaluator` (cosine) |
|
- Typical **hyperparameters**: |
|
- epochs: 5 |
|
- per-device batch size: 10 (gradient accumulation: 4) |
|
- learning rate: 5e-6 (AdamW fused) |
|
- weight decay: 0.01 |
|
- warmup: ratio 20k steps |
|
- fp16: true |
|
|
|
## Evaluation |
|
- **Task**: semantic similarity on dev/test splits built from the relation-derived pairs. |
|
- **Metric**: cosine-based correlation (Spearman/Pearson) where applicable, or discrimination between positive vs. negative pairs. |
|
|
|
 |
|
|
|
 |
|
|
|
 |
|
|
|
|
|
## How to use |
|
|
|
Sentence-Transformers: |
|
``` python |
|
# Python |
|
from sentence_transformers import SentenceTransformer, util |
|
|
|
model = SentenceTransformer("radlab/semantic-euro-bert-encoder-v1", trust_remote_code=True) |
|
|
|
texts = ["zamek", "drzwi", "wiadro", "horyzont", "ocean"] |
|
emb = model.encode(texts, convert_to_tensor=True, normalize_embeddings=True) |
|
scores = util.cos_sim(emb, emb) |
|
print(scores) # higher = more semantically similar |
|
``` |
|
|
|
Transformers (feature extraction): |
|
``` python |
|
# Python |
|
from transformers import AutoModel, AutoTokenizer |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
name = "radlab/semantic-euro-bert-encoder-v1" |
|
tok = AutoTokenizer.from_pretrained(name) |
|
mdl = AutoModel.from_pretrained(name, trust_remote_code=True) |
|
|
|
texts = ["student", "żak"] |
|
tokens = tok(texts, padding=True, truncation=True, return_tensors="pt") |
|
with torch.no_grad(): |
|
out = mdl(**tokens) |
|
emb = out.last_hidden_state.mean(dim=1) |
|
emb = F.normalize(emb, p=2, dim=1) |
|
|
|
sim = emb @ emb.T |
|
print(sim) |
|
``` |