A version of the chcaa/dfm-encoder-large-v1 trained using SimCSE. It was trained as a part of the Scandinavian Embeddings Benchmark to establish a naive baseline for SimCSE.
Hyperparameters
Trained using the SimCSE implementation with:
CUDA_VISIBLE_DEVICES=0 python train.py \
--train_file data/dfm_paragraphs.txt \ # paragraphs extract from Danish Gigaword
--model_name_or_path chcaa/dfm-encoder-large-v1 \
--num_train_epochs 1 \
--per_device_train_batch_size 128 \
--learning_rate 1e-5 \
--max_seq_length 32 \
--evaluation_strategy steps \
--metric_for_best_model stsb_spearman \
--load_best_model_at_end \
--pooler_type cls \
--mlp_only_train \
--do_mlm \
--overwrite_output_dir \
--temp 0.05 \
--do_train \
--fp16
Citation
To cite this work please refer to the following article:
Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding. https://openreview.net/forum?id=pJl_i7HIA72
or use the following BibTeX:
@article{enevoldsenScandinavianEmbeddingBenchmarks2024,
title = {The {Scandinavian} {Embedding} {Benchmarks}: {Comprehensive} {Assessment} of {Multilingual} and {Monolingual} {Text} {Embedding}},
shorttitle = {The {Scandinavian} {Embedding} {Benchmarks}},
url = {https://openreview.net/forum?id=pJl_i7HIA72},
language = {en},
urldate = {2024-04-12},
author = {Enevoldsen, Kenneth and Kardos, Márton and Muennighoff, Niklas and Nielbo, Kristoffer},
month = feb,
year = {2024},
}
- Downloads last month
- 172