sigridjineth commited on
Commit
36b0f2b
ยท
verified ยท
1 Parent(s): 272b86f

Initial model upload

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ๋ฌผ๋ก ์ž…๋‹ˆ๋‹ค. ํ•™์Šต ๊ณผ์ •๊ณผ ์‹ค์ œ ์ถ”๋ก  ์„ฑ๋Šฅ์„ ํฌํ•จํ•˜์—ฌ ์ˆ˜์ •ํ•œ ๋ชจ๋ธ ์นด๋“œ๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‹ค์‹œ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
2
+
3
+ -----
4
+
5
+ ## **Model Card: ColBERT-ko-embeddinggemma-300m**
6
+
7
+ ### **Model Description**
8
+
9
+ This is a **ColBERT-style** late-interaction retrieval model based on Google's `embeddinggemma-300m`. It has been fine-tuned on the `williamjeong2/msmarco-triplets-ko-v2` dataset, making it specialized for semantic search and information retrieval tasks in the **Korean language**.
10
+
11
+ The model produces token-level embeddings for both queries and documents. This enables highly accurate and efficient retrieval through the ColBERT MaxSim scoring mechanism, which calculates the relevance between a query and a document at a fine-grained token level.
12
+
13
+ -----
14
+
15
+ ### **Performance & Evaluation**
16
+
17
+ The model demonstrated stable and consistent improvement throughout the training process. Starting from a strong in-batch Recall@1 of \~75-80%, the model was validated every 50 steps, with checkpoints saved based on validation performance. Key metrics like validation loss steadily decreased while Recall@1 increased, indicating successful generalization without signs of overfitting.
18
+
19
+ #### **Semantic Inference Example (in Korean)**
20
+
21
+ The true power of the fine-tuned model is its ability to understand semantic context beyond simple keyword matching. In the following challenging example, the fine-tuned model correctly infers the answer, while the original base model fails.
22
+
23
+ * **์ฟผ๋ฆฌ (Query):**
24
+
25
+ ```
26
+ "์ผ๋ก  ๋จธ์Šคํฌ๊ฐ€ ์„ค๋ฆฝํ•œ ์ „๊ธฐ์ฐจ ํšŒ์‚ฌ๋Š” ์–ด๋””์•ผ?"
27
+ ```
28
+
29
+ * **โœ… Fine-tuned Model Results:**
30
+
31
+ 1. **`Score: 10.00`**: **ํ…Œ์Šฌ๋ผ**๋Š” ๋ชจ๋ธ S, 3, X, Y๋ฅผ ์ƒ์‚ฐํ•˜๋ฉฐ ์˜คํ† ํŒŒ์ผ๋Ÿฟ ๊ธฐ๋Šฅ์œผ๋กœ ์œ ๋ช…ํ•ฉ๋‹ˆ๋‹ค.
32
+ 2. **`Score: 9.51`**: **์ŠคํŽ˜์ด์ŠคX**๋Š” ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋กœ์ผ“์„ ๊ฐœ๋ฐœํ•˜์—ฌ ์šฐ์ฃผ ํƒ์‚ฌ ๋น„์šฉ์„ ํฌ๊ฒŒ ๋‚ฎ์ท„์Šต๋‹ˆ๋‹ค.
33
+ 3. **`Score: 8.57`**: ์•„๋งˆ์กด ์›น ์„œ๋น„์Šค(AWS)๋Š” ํด๋ผ์šฐ๋“œ ์ปดํ“จํŒ… ์‹œ์žฅ์˜ ์„ ๋‘์ฃผ์ž์ž…๋‹ˆ๋‹ค.
34
+
35
+ * **โŒ Original Model Results:**
36
+
37
+ 1. **`Score: 8.55`**: ์ˆ˜๋„๊ถŒ ์ „์ฒ ์€ ์„œ์šธ๊ณผ ์ฃผ๋ณ€ ๋„์‹œ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ์ค‘์š”ํ•œ ๊ตํ†ต์ˆ˜๋‹จ์ž…๋‹ˆ๋‹ค.
38
+ 2. **`Score: 8.54`**: **ํ…Œ์Šฌ๋ผ**๋Š” ๋ชจ๋ธ S, 3, X, Y๋ฅผ ์ƒ์‚ฐํ•˜๋ฉฐ ์˜คํ† ํŒŒ์ผ๋Ÿฟ ๊ธฐ๋Šฅ์œผ๋กœ ์œ ๋ช…ํ•ฉ๋‹ˆ๋‹ค.
39
+ 3. **`Score: 8.39`**: **์ŠคํŽ˜์ด์ŠคX**๋Š” ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋กœ์ผ“์„ ๊ฐœ๋ฐœํ•˜์—ฌ ์šฐ์ฃผ ํƒ์‚ฌ ๋น„์šฉ์„ ํฌ๊ฒŒ ๋‚ฎ์ท„์Šต๋‹ˆ๋‹ค.
40
+
41
+ **Analysis**: The fine-tuned model correctly identifies 'Tesla' by understanding the semantic relationship between the query and the document, even with no direct keyword overlap. In contrast, the original model is easily confused by distractors and fails to rank the correct answer first, demonstrating the significant impact of the ColBERT fine-tuning process.
42
+
43
+ -----
44
+
45
+ ### **Intended Uses**
46
+
47
+ The primary use case is high-performance semantic search for Korean text. It is designed to be used as a dual encoder in a retrieval pipeline:
48
+
49
+ 1. **Offline Indexing**: Encode your document corpus into token-level embeddings. Each document is represented as a matrix of vectors (`Ld x D`).
50
+ 2. **Online Search**: Encode an incoming query into its token-level embeddings (`Lq x D`). Use the efficient **MaxSim** algorithm to score and rank documents from your index.
51
+
52
+ -----
53
+
54
+ ### **Training Procedure**
55
+
56
+ The model was trained using an 8-GPU setup with the Hugging Face Accelerate library, utilizing in-batch and cross-device negatives.
57
+
58
+ * **Base Model**: `google/embeddinggemma-300m`
59
+ * **Dataset**: `williamjeong2/msmarco-triplets-ko-v2` (train split)
60
+ * **Key Hyperparameters**:
61
+ * Precision: `bf16`
62
+ * Query Max Length: `128`
63
+ * Document Max Length: `1024`
64
+ * Learning Rate: `5e-6` (base) & `1e-4` (projection head)
65
+ * Effective Batch Size: `512` (32 per device \* 8 devices \* 2 grad\_accum)
66
+ * Epochs: `1`
encoder/config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_sliding_window_pattern": 6,
3
+ "architectures": [
4
+ "Gemma3TextModel"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "attn_logit_softcapping": null,
9
+ "bos_token_id": 2,
10
+ "dtype": "float32",
11
+ "eos_token_id": 1,
12
+ "final_logit_softcapping": null,
13
+ "head_dim": 256,
14
+ "hidden_activation": "gelu_pytorch_tanh",
15
+ "hidden_size": 768,
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 1152,
18
+ "layer_types": [
19
+ "sliding_attention",
20
+ "sliding_attention",
21
+ "sliding_attention",
22
+ "sliding_attention",
23
+ "sliding_attention",
24
+ "full_attention",
25
+ "sliding_attention",
26
+ "sliding_attention",
27
+ "sliding_attention",
28
+ "sliding_attention",
29
+ "sliding_attention",
30
+ "full_attention",
31
+ "sliding_attention",
32
+ "sliding_attention",
33
+ "sliding_attention",
34
+ "sliding_attention",
35
+ "sliding_attention",
36
+ "full_attention",
37
+ "sliding_attention",
38
+ "sliding_attention",
39
+ "sliding_attention",
40
+ "sliding_attention",
41
+ "sliding_attention",
42
+ "full_attention"
43
+ ],
44
+ "max_position_embeddings": 2048,
45
+ "model_type": "gemma3_text",
46
+ "num_attention_heads": 3,
47
+ "num_hidden_layers": 24,
48
+ "num_key_value_heads": 1,
49
+ "pad_token_id": 0,
50
+ "query_pre_attn_scalar": 256,
51
+ "rms_norm_eps": 1e-06,
52
+ "rope_local_base_freq": 10000.0,
53
+ "rope_scaling": null,
54
+ "rope_theta": 1000000.0,
55
+ "sliding_window": 512,
56
+ "transformers_version": "4.56.1",
57
+ "use_bidirectional_attention": true,
58
+ "use_cache": true,
59
+ "vocab_size": 262144
60
+ }
encoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:93fb7b82f5437f5b5e17cb4ee0e33c8cc7031a105a590e1b1bfe6c50dbe428e9
3
+ size 1211486072
inference.py ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import torch
3
+ import torch.nn as nn
4
+ import torch.nn.functional as F
5
+ from transformers import AutoTokenizer, AutoModel
6
+ from typing import List, Dict
7
+
8
+ # ---------------------------------------------------
9
+ # ํ•™์Šต ์Šคํฌ๋ฆฝํŠธ์—์„œ ์‚ฌ์šฉ๋œ ๋ชจ๋ธ ํด๋ž˜์Šค์™€ ํ•จ์ˆ˜ (๋ณ€๊ฒฝ ์—†์Œ)
10
+ # ---------------------------------------------------
11
+
12
+ class ColBERTEncoder(nn.Module):
13
+ def __init__(self, model_name: str, colbert_dim: int):
14
+ super().__init__()
15
+ self.encoder = AutoModel.from_pretrained(model_name)
16
+ hidden = self.encoder.config.hidden_size
17
+ self.proj = nn.Linear(hidden, colbert_dim, bias=False)
18
+
19
+ def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
20
+ out = self.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
21
+ H = out.last_hidden_state
22
+ H = self.proj(H)
23
+ H = F.normalize(H, p=2, dim=-1)
24
+ return H
25
+
26
+ def colbert_logits(
27
+ Q: torch.Tensor, MQ: torch.Tensor,
28
+ D: torch.Tensor, MD: torch.Tensor,
29
+ ) -> torch.Tensor:
30
+ sim = torch.einsum("qd,kd->qk", Q.view(-1, Q.size(-1)), D.view(-1, D.size(-1)))
31
+ sim = sim.view(Q.size(0), Q.size(1), D.size(0), D.size(1))
32
+ sim = sim.masked_fill(~MD.bool().unsqueeze(0).unsqueeze(1), -1e4)
33
+ sim = sim.max(dim=-1).values
34
+ sim = sim.masked_fill(~MQ.bool().unsqueeze(-1), 0)
35
+ scores = sim.sum(dim=1)
36
+ return scores.squeeze(0)
37
+
38
+ # ---------------------------------------------------
39
+ # ์ถ”๋ก  ๋ฐ ๊ฒฐ๊ณผ ์ถœ๋ ฅ์„ ์œ„ํ•œ ํ—ฌํผ ํ•จ์ˆ˜
40
+ # ---------------------------------------------------
41
+ def run_inference(model: ColBERTEncoder, tokenizer: AutoTokenizer, query: str, documents: List[str], device: torch.device):
42
+ """์ฃผ์–ด์ง„ ๋ชจ๋ธ๋กœ ์ถ”๋ก ์„ ์‹คํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ํ•จ์ˆ˜"""
43
+ # ์ฟผ๋ฆฌ ๋ฐ ๋ฌธ์„œ ์ธ์ฝ”๋”ฉ
44
+ with torch.no_grad():
45
+ q_inputs = tokenizer(query, return_tensors="pt", max_length=64, truncation=True).to(device)
46
+ Hq = model(**q_inputs)
47
+
48
+ d_inputs = tokenizer(documents, padding=True, truncation=True, return_tensors="pt", max_length=192).to(device)
49
+ Hd = model(**d_inputs)
50
+
51
+ # ColBERT ์ ์ˆ˜ ๊ณ„์‚ฐ
52
+ scores = []
53
+ for i in range(len(documents)):
54
+ score = colbert_logits(
55
+ Q=Hq, MQ=q_inputs['attention_mask'],
56
+ D=Hd[i].unsqueeze(0), MD=d_inputs['attention_mask'][i].unsqueeze(0)
57
+ )
58
+ scores.append(score.item())
59
+
60
+ # ๊ฒฐ๊ณผ ์ถœ๋ ฅ
61
+ ranked_results = sorted(zip(scores, documents), key=lambda x: x[0], reverse=True)
62
+ for i, (score, doc) in enumerate(ranked_results):
63
+ print(f" Rank {i+1} (Score: {score:.2f}): {doc}")
64
+
65
+ # ---------------------------------------------------
66
+ # ๋ฉ”์ธ ๋น„๊ต ๋กœ์ง
67
+ # ---------------------------------------------------
68
+ def main():
69
+ # --- โš ๏ธ ์‚ฌ์šฉ์ž๊ฐ€ ์ˆ˜์ •ํ•ด์•ผ ํ•  ๋ถ€๋ถ„ ---
70
+ MODEL_NAME = "google/embeddinggemma-300m"
71
+ COLBERT_DIM = 128
72
+ CHECKPOINT_PATH = "ckpts_dist/vB/epoch1"
73
+ # ------------------------------------
74
+
75
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
76
+ print(f"Using device: {device}\n")
77
+
78
+ # 1. ํŒŒ์ธํŠœ๋‹๋œ ๋ชจ๋ธ ๋กœ๋”ฉ
79
+ print("Loading fine-tuned model...")
80
+ tokenizer = AutoTokenizer.from_pretrained(os.path.join(CHECKPOINT_PATH, "tokenizer"))
81
+ finetuned_model = ColBERTEncoder(MODEL_NAME, COLBERT_DIM).to(device)
82
+ finetuned_model.encoder = AutoModel.from_pretrained(os.path.join(CHECKPOINT_PATH, "encoder")).to(device)
83
+ proj_path = os.path.join(CHECKPOINT_PATH, "proj.pt")
84
+ finetuned_model.proj.load_state_dict(torch.load(proj_path, map_location=device))
85
+ finetuned_model.eval()
86
+ print("Fine-tuned model loaded.")
87
+
88
+ # 2. ์›๋ณธ(pre-trained) ๋ชจ๋ธ ๋กœ๋”ฉ
89
+ print("\nLoading original (pre-trained) model for comparison...")
90
+ original_model = ColBERTEncoder(MODEL_NAME, COLBERT_DIM).to(device)
91
+ # encoder๋Š” ํ—ˆ๊น…ํŽ˜์ด์Šค์—์„œ ๋ฐ”๋กœ ๋กœ๋“œ, proj ๋ ˆ์ด์–ด๋Š” ๋žœ๋ค ์ดˆ๊ธฐํ™” ์ƒํƒœ ๊ทธ๋Œ€๋กœ ๋‘ 
92
+ original_model.eval()
93
+ print("Original model loaded.")
94
+
95
+ # 3. ๊ฒ€์ƒ‰ํ•  ์ฟผ๋ฆฌ์™€ ๋ฌธ์„œ ์ •์˜
96
+ query = "์ผ๋ก  ๋จธ์Šคํฌ๊ฐ€ ์„ค๋ฆฝํ•œ ์ „๊ธฐ์ฐจ ํšŒ์‚ฌ๋Š” ์–ด๋””์•ผ?"
97
+ documents = [
98
+ "์ŠคํŽ˜์ด์ŠคX๋Š” ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋กœ์ผ“์„ ๊ฐœ๋ฐœํ•˜์—ฌ ์šฐ์ฃผ ํƒ์‚ฌ ๋น„์šฉ์„ ํฌ๊ฒŒ ๋‚ฎ์ท„์Šต๋‹ˆ๋‹ค.", # ์ •๋‹ต๊ณผ ๊ฐ™์€ ์ธ๋ฌผ, ๋‹ค๋ฅธ ์ฃผ์ œ (๊ฐ•๋ ฅํ•œ ์˜ค๋‹ต ํ›„๋ณด 1)
99
+ "ํ…Œ์Šฌ๋ผ๋Š” ๋ชจ๋ธ S, 3, X, Y๋ฅผ ์ƒ์‚ฐํ•˜๋ฉฐ ์˜คํ† ํŒŒ์ผ๋Ÿฟ ๊ธฐ๋Šฅ์œผ๋กœ ์œ ๋ช…ํ•ฉ๋‹ˆ๋‹ค.", # โœ… ํ‚ค์›Œ๋“œ ์—†์ด ์˜๋ฏธ์ ์œผ๋กœ ์ •๋‹ต
100
+ "์•„๋งˆ์กด ์›น ์„œ๋น„์Šค(AWS)๋Š” ํด๋ผ์šฐ๋“œ ์ปดํ“จํŒ… ์‹œ์žฅ์˜ ์„ ๋‘์ฃผ์ž์ž…๋‹ˆ๋‹ค.", # ๊ด€๋ จ ์—†๋Š” ๋‚ด์šฉ
101
+ "์ผ๋ณธ์˜ ์ˆ˜๋„๋Š” ๋„์ฟ„์ž…๋‹ˆ๋‹ค. ๋ฒš๊ฝƒ์ด ์•„๋ฆ„๋‹ค์šด ๋„์‹œ์ฃ .",
102
+ "๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ์ˆ˜๋„๋Š” ์„œ์šธ์ž…๋‹ˆ๋‹ค. ์„œ์šธ์€ ๊ฒฝ์ œ์™€ ๋ฌธํ™”์˜ ์ค‘์‹ฌ์ง€์ž…๋‹ˆ๋‹ค.",
103
+ "์ˆ˜๋„๊ถŒ ์ „์ฒ ์€ ์„œ์šธ๊ณผ ์ฃผ๋ณ€ ๋„์‹œ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ์ค‘์š”ํ•œ ๊ตํ†ต์ˆ˜๋‹จ์ž…๋‹ˆ๋‹ค.",
104
+ "ํ”„๋ž‘์Šค์˜ ์ˆ˜๋„๋Š” ํŒŒ๋ฆฌ์ด๋ฉฐ, ๏ฟฝ๏ฟฝ๏ฟฝํŽ ํƒ‘์œผ๋กœ ์œ ๋ช…ํ•ฉ๋‹ˆ๋‹ค.",
105
+ ]
106
+
107
+ print("\n" + "="*50)
108
+ print(f"Query: {query}")
109
+ print("="*50 + "\n")
110
+
111
+ # 4. ๊ฐ ๋ชจ๋ธ๋กœ ์ถ”๋ก  ์‹คํ–‰ ๋ฐ ๊ฒฐ๊ณผ ๋น„๊ต
112
+ print("--- 1. โœ… Fine-tuned Model Results ---")
113
+ run_inference(finetuned_model, tokenizer, query, documents, device)
114
+
115
+ print("\n--- 2. โŒ Original Model Results ---")
116
+ run_inference(original_model, tokenizer, query, documents, device)
117
+
118
+
119
+ if __name__ == "__main__":
120
+ main()
proj.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:893407d0a6a7475f553117c9ef6fb52d6e8a11e98492ae915071d14568992713
3
+ size 394772
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "boi_token": "<start_of_image>",
3
+ "bos_token": {
4
+ "content": "<bos>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ "eoi_token": "<end_of_image>",
11
+ "eos_token": {
12
+ "content": "<eos>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false
17
+ },
18
+ "image_token": "<image_soft_token>",
19
+ "pad_token": {
20
+ "content": "<pad>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "unk_token": {
27
+ "content": "<unk>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
tokenizer/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6852f8d561078cc0cebe70ca03c5bfdd0d60a45f9d2e0e1e4cc05b68e9ec329e
3
+ size 33385008
tokenizer/tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff