A version of the chcaa/dfm-encoder-large-v1 trained using SimCSE. It was trained as a part of the Scandinavian Embeddings Benchmark to establish a naive baseline for SimCSE.

Hyperparameters

Trained using the SimCSE implementation with:

CUDA_VISIBLE_DEVICES=0 python train.py \
    --train_file data/dfm_paragraphs.txt \ # paragraphs extract from Danish Gigaword
    --model_name_or_path chcaa/dfm-encoder-large-v1 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 128 \
    --learning_rate 1e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --pooler_type cls \
    --mlp_only_train \
    --do_mlm \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --fp16 

Citation

To cite this work please refer to the following article:

Enevoldsen, K., Kardos, M., Muennighoff, N., & Nielbo, K. (2024). The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding. https://openreview.net/forum?id=pJl_i7HIA72

or use the following BibTeX:

@article{enevoldsenScandinavianEmbeddingBenchmarks2024,
    title = {The {Scandinavian} {Embedding} {Benchmarks}: {Comprehensive} {Assessment} of {Multilingual} and {Monolingual} {Text} {Embedding}},
    shorttitle = {The {Scandinavian} {Embedding} {Benchmarks}},
    url = {https://openreview.net/forum?id=pJl_i7HIA72},
    language = {en},
    urldate = {2024-04-12},
    author = {Enevoldsen, Kenneth and Kardos, Márton and Muennighoff, Niklas and Nielbo, Kristoffer},
    month = feb,
    year = {2024},
}
Downloads last month
172
Safetensors
Model size
0.4B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KennethEnevoldsen/dfm-sentence-encoder-large

Finetunes
6 models