semantic-euro-bert-encoder-v1 / README.md

Update README.md

0ae035f verified 4 days ago

3.94 kB

	---
	license: apache-2.0
	language:
	- pl
	- en
	- de
	base_model:
	- EuroBERT/EuroBERT-610m
	tags:
	- sentence-transformers
	- '- embeddings'
	- plwordnet
	- semantic-relations
	- semantic-search
	pipeline_tag: sentence-similarity
	---

	# PLWordNet Semantic Embedder (bi-encoder)

	A Polish semantic embedder trained on pairs constructed from plWordNet (Słowosieć) semantic relations and external descriptions of meanings.
	Every relation between lexical units and synsets is transformed into training/evaluation examples.

	The dataset mixes meanings’ usage signals: emotions, definitions, and external descriptions (Wikipedia, sentence-split).
	The embedder mimics semantic relations: it pulls together embeddings that are linked by “positive” relations
	(e.g., synonymy, hypernymy/hyponymy as defined in the dataset) and pushes apart embeddings linked by “negative”
	relations (e.g., antonymy or mutually exclusive relations). Source code and training scripts:
	- GitHub: [https://github.com/radlab-dev-group/radlab-plwordnet](https://github.com/radlab-dev-group/radlab-plwordnet)

	## Model summary

	- Architecture: bi-encoder built with `sentence-transformers` (transformer encoder + pooling).
	- Use cases: semantic similarity and semantic search for Polish words, senses, definitions, and sentences.
	- Objective: CosineSimilarityLoss on positive/negative pairs.
	- Behavior: preserves the topology of semantic relations derived from plWordNet.

	## Training data

	Constructed from plWordNet relations between lexical units and synsets; each relation yields example pairs.
	Augmented with:
	- definitions,
	- usage examples (including emotion annotations where available),
	- external descriptions from Wikipedia (split into sentences).

	Positive pairs correspond to relations expected to increase similarity;
	negative pairs correspond to relations expected to decrease similarity.
	Additional hard/soft negatives may include unrelated meanings.

	## Training details
	- Trainer: `SentenceTransformerTrainer`
	- Loss: `CosineSimilarityLoss`
	- Evaluator: `EmbeddingSimilarityEvaluator` (cosine)
	- Typical hyperparameters:
	- epochs: 5
	- per-device batch size: 10 (gradient accumulation: 4)
	- learning rate: 5e-6 (AdamW fused)
	- weight decay: 0.01
	- warmup: ratio 20k steps
	- fp16: true

	## Evaluation
	- Task: semantic similarity on dev/test splits built from the relation-derived pairs.
	- Metric: cosine-based correlation (Spearman/Pearson) where applicable, or discrimination between positive vs. negative pairs.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/DCepnAcPcv4EblAmtgu7R.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/TWHyVDItYwNbFEyI0i--n.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/o-CFHkDYw62Lyh1MKvG4M.png)


	## How to use

	Sentence-Transformers:
	``` python
	# Python
	from sentence_transformers import SentenceTransformer, util

	model = SentenceTransformer("radlab/semantic-euro-bert-encoder-v1", trust_remote_code=True)

	texts = ["zamek", "drzwi", "wiadro", "horyzont", "ocean"]
	emb = model.encode(texts, convert_to_tensor=True, normalize_embeddings=True)
	scores = util.cos_sim(emb, emb)
	print(scores) # higher = more semantically similar
	```

	Transformers (feature extraction):
	``` python
	# Python
	from transformers import AutoModel, AutoTokenizer
	import torch
	import torch.nn.functional as F

	name = "radlab/semantic-euro-bert-encoder-v1"
	tok = AutoTokenizer.from_pretrained(name)
	mdl = AutoModel.from_pretrained(name, trust_remote_code=True)

	texts = ["student", "żak"]
	tokens = tok(texts, padding=True, truncation=True, return_tensors="pt")
	with torch.no_grad():
	out = mdl(**tokens)
	emb = out.last_hidden_state.mean(dim=1)
	emb = F.normalize(emb, p=2, dim=1)

	sim = emb @ emb.T
	print(sim)
	```