SentenceTransformer based on dangvantuan/vietnamese-document-embedding

This is a sentence-transformers model finetuned from dangvantuan/vietnamese-document-embedding. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: dangvantuan/vietnamese-document-embedding
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'VietnameseModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("TTHDZ/finetuned_vietnamese-document-embedding")
# Run inference
sentences = [
    'Thách thức nào đặc thù khi huấn luyện AI cho an ninh mạng?',
    'Public_076\nKiến trúc tổng thể của hệ thống phòng thủ AI\nTầng phân tích và học máy\n* Giám sát (Supervised Learning): sử dụng dữ liệu đã gắn nhãn (ví dụ, gói tin tấn công) để dự đoán tấn công đã biết.\n  * Không giám sát (Unsupervised/Anomaly Detection): tìm kiếm mẫu hành vi bất thường, hữu ích với các cuộc tấn công 0-day.\n  * Học bán giám sát và tự giám sát: giảm phụ thuộc vào dữ liệu gắn nhãn khan hiếm.\n  * Học tăng cường (Reinforcement Learning): cho phép hệ thống tự điều chỉnh chính sách phản ứng dựa trên kết quả.',
    'Public_076\nKiến trúc tổng thể của hệ thống phòng thủ AI\nMột giải pháp AI an ninh mạng toàn diện thường bao gồm nhiều lớp:',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1., 1., 1.],
#         [1., 1., 1.],
#         [1., 1., 1.]])

Training Details

Training Dataset

Unnamed Dataset

Size: 6,765 training samples
Columns: sentence_0, sentence_1, and sentence_2

Approximate statistics based on the first 1000 samples:

	sentence_0	sentence_1	sentence_2
type	string	string	string
details	min: 9 tokens mean: 25.87 tokens max: 77 tokens	min: 25 tokens mean: 846.87 tokens max: 8192 tokens	min: 11 tokens mean: 630.59 tokens max: 8192 tokens

Samples:

sentence_0	sentence_1	sentence_2
`Theo tài liệu Public_503, bộ kí tự (character set) trong ngôn ngữ lập trình có chức năng gì?`	Public_503 Ngôn ngữ lập trình (NNLT) Các thành phần cơ bản của NNLT bao gồm: * Bộ kí tự (character set) hay bảng chữ cái dùng để viết chương trình. * Cú pháp (syntax) là bộ quy tắc để viết chương trình. * Ngữ nghĩa (semantic) xác định ý nghĩa các thao tác, hành động cần phải thực hiện, ngữ cảnh (context) của các câu lệnh trong chương trình. Hiện đã có hàng nghìn NNLT được thiết kế, và hàng năm lại có thêm nhiều NNLT mới xuất hiện. Sự phát triển của NNLT gắn liền với sự phát triển của ngành tin học. Mỗi loại NNLT phù hợp với một số lớp bài toán nhất định. Phân loại NNLT: * Ngôn ngữ máy (machine language) hay còn gọi là NNLT cấp thấp có tập lệnh phụ thuộc vào một hệ máy cụ thể. Chương trình viết bằng ngôn ngữ máy sử dụng bảng chữ cái chỉ gồm 2 kí tự 0, 1. Chương trình ngôn ngữ máy được nạp trực tiếp vào bộ nhớ và thực hiện ngay. * Ngôn ngữ lập trình cấp cao nói chung không phụ thuộc vào loại máy tính cụ thể. Chương trình viết bằng NNLT cấp cao sử dụng bộ kí tự phon...	Public_183 Transactional outbox pattern? Solution * Nếu mất connection và không publish được event thì một cách đơn giản có thể nghĩ đến là retry. Những thứ cần quan tâm là retry trong bao lâu, bao nhiêu lần? * Cần store message ở đâu để đảm bảo nếu application crash/restart thì vẫn có message để retry: file, database, distributed storage? * Nếu để đảm bảo vừa atomic mà vừa consistent thì chỉ có nhồi vào chung transaction thôi. Có nghĩa là message/event cần được lưu và database và xử lý chung một transaction với business logic? Từ những idea trên, liệu bạn đã mường tượng ra tổng thể solution cần thực hiện là như thế nào chưa? Đi từng bước một nhé.
`Theo tài liệu Public_194, khái niệm Database trong MongoDB là gì?`	Public_194 CÁC THUẬT NGỮ MONGODB THƯỜNG DÙNG Database Trong MongoDB, database là một container vật lý chứa tập hợp các collection. Một database có thể chứa 0 collection hoặc nhiều collection. Một phiên bản máy chủ MongoDB có thể lưu trữ nhiều database và không có giới hạn về số lượng database có thể được lưu trữ trên một phiên bản, nhưng giới hạn ở không gian bộ nhớ ảo có thể được phân bổ bởi hệ điều hành.	Public_194 MONGODB hoạt động như thế nào? MongoDB lưu trữ dữ liệu như thế nào? Như chúng ta biết rằng MongoDB là một máy chủ cơ sở dữ liệu và dữ liệu được lưu trữ trong các cơ sở dữ liệu này. Hay nói cách khác, môi trường MongoDB cung cấp cho bạn một máy chủ mà bạn có thể khởi động và sau đó tạo nhiều cơ sở dữ liệu trên đó bằng MongoDB. Nhờ vào cơ sở dữ liệu NoSQL, dữ liệu được lưu trữ dưới dạng collection và document. Do đó, cơ sở dữ liệu, collection và document có mối liên hệ với nhau như hình dưới đây: \|\| Trong máy chủ MongoDB, bạn có thể tạo nhiều cơ sở dữ liệu và nhiều collection. Cách cơ sở dữ liệu MongoDB chứa các collection cũng giống như cách cơ sở dữ liệu MySQL chứa các table. Bên trong collection, chúng ta có document. Các document này chứa dữ liệu mà bạn muốn lưu trữ trong cơ sở dữ liệu MongoDB và một collection có thể chứa nhiều document. Đồng thời, với tính chất schema-less (không cần một cấu trúc lưu trữ dữ liệu), document này không nhất thiết phải giống với doc...
`Dự án Stellarator nào được đề cập trong tài liệu?`	`Public_096 Công nghệ lò phản ứng Stellarator: * Thiết kế từ trường xoắn phức tạp, giúp plasma ổn định hơn và vận hành liên tục. * Dự án: Wendelstein 7-X (Đức).`	Public_096 Tại sao các nhà khoa học nghiên cứu năng lượng nhiệt hạch (nuclear fusion)? nan Ngay từ khi lý thuyết về phản ứng nhiệt hạch được hiểu rõ vào những năm 1930, các nhà khoa học – và ngày càng nhiều kỹ sư – đã theo đuổi mục tiêu tái tạo và khai thác nguồn năng lượng này. Lý do là nếu có thể tái tạo phản ứng nhiệt hạch trên Trái Đất ở quy mô công nghiệp, nó có thể cung cấp nguồn năng lượng sạch, an toàn và gần như vô hạn với chi phí hợp lý để đáp ứng nhu cầu của thế giới. Phản ứng nhiệt hạch có khả năng tạo ra năng lượng gấp bốn lần so với phân hạch (đang được dùng trong các nhà máy điện hạt nhân) và gần bốn triệu lần so với việc đốt dầu hoặc than tính theo cùng khối lượng nhiên liệu. Hầu hết các thiết kế lò phản ứng nhiệt hạch đang được phát triển đều sử dụng hỗn hợp deuteri và triti – các nguyên tử hydro chứa thêm nơtron. Về lý thuyết, chỉ cần vài gram hai loại nhiên liệu này có thể tạo ra 1 terajoule năng lượng , tương đương lượng năng lượng một người ở quốc g...

Loss: TripletLoss with these parameters:

{
    "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
    "triplet_margin": 5
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 2
per_device_eval_batch_size: 2
fp16: True
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 3
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
project: huggingface
trackio_space_id: trackio
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss
0.1478	500	5.0049
0.2956	1000	4.9978
0.4434	1500	4.9986
0.5912	2000	4.9993
0.7390	2500	5.0005
0.8868	3000	4.999
1.0346	3500	5.0014
1.1824	4000	4.9977
1.3302	4500	4.9989
1.4780	5000	5.003
1.6258	5500	5.0022
1.7736	6000	4.9975
1.9214	6500	4.9974
2.0692	7000	4.9986
2.2170	7500	5.0003
2.3648	8000	4.9994
2.5126	8500	4.9971
2.6604	9000	5.0037
2.8082	9500	4.997
2.9560	10000	4.9997

Framework Versions

Python: 3.12.11
Sentence Transformers: 5.1.0
Transformers: 4.57.1
PyTorch: 2.7.0+cu126
Accelerate: 1.11.0
Datasets: 3.6.0
Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}