Initial model upload

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +66 -0
encoder/config.json +60 -0
encoder/model.safetensors +3 -0
inference.py +120 -0
proj.pt +3 -0
tokenizer/special_tokens_map.json +33 -0
tokenizer/tokenizer.json +3 -0
tokenizer/tokenizer_config.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+물론입니다. 학습 과정과 실제 추론 성능을 포함하여 수정한 모델 카드를 아래와 같이 다시 작성했습니다.
+-----
+## **Model Card: ColBERT-ko-embeddinggemma-300m**
+### **Model Description**
+This is a **ColBERT-style** late-interaction retrieval model based on Google's `embeddinggemma-300m`. It has been fine-tuned on the `williamjeong2/msmarco-triplets-ko-v2` dataset, making it specialized for semantic search and information retrieval tasks in the **Korean language**.
+The model produces token-level embeddings for both queries and documents. This enables highly accurate and efficient retrieval through the ColBERT MaxSim scoring mechanism, which calculates the relevance between a query and a document at a fine-grained token level.
+-----
+### **Performance & Evaluation**
+The model demonstrated stable and consistent improvement throughout the training process. Starting from a strong in-batch Recall@1 of \~75-80%, the model was validated every 50 steps, with checkpoints saved based on validation performance. Key metrics like validation loss steadily decreased while Recall@1 increased, indicating successful generalization without signs of overfitting.
+#### **Semantic Inference Example (in Korean)**
+The true power of the fine-tuned model is its ability to understand semantic context beyond simple keyword matching. In the following challenging example, the fine-tuned model correctly infers the answer, while the original base model fails.
+  * **쿼리 (Query):**
+    ```
+    "일론 머스크가 설립한 전기차 회사는 어디야?"
+    ```
+  * **✅ Fine-tuned Model Results:**
+    1.  **`Score: 10.00`**: **테슬라**는 모델 S, 3, X, Y를 생산하며 오토파일럿 기능으로 유명합니다.
+    2.  **`Score: 9.51`**: **스페이스X**는 재사용 가능한 로켓을 개발하여 우주 탐사 비용을 크게 낮췄습니다.
+    3.  **`Score: 8.57`**: 아마존 웹 서비스(AWS)는 클라우드 컴퓨팅 시장의 선두주자입니다.
+  * **❌ Original Model Results:**
+    1.  **`Score: 8.55`**: 수도권 전철은 서울과 주변 도시를 연결하는 중요한 교통수단입니다.
+    2.  **`Score: 8.54`**: **테슬라**는 모델 S, 3, X, Y를 생산하며 오토파일럿 기능으로 유명합니다.
+    3.  **`Score: 8.39`**: **스페이스X**는 재사용 가능한 로켓을 개발하여 우주 탐사 비용을 크게 낮췄습니다.
+**Analysis**: The fine-tuned model correctly identifies 'Tesla' by understanding the semantic relationship between the query and the document, even with no direct keyword overlap. In contrast, the original model is easily confused by distractors and fails to rank the correct answer first, demonstrating the significant impact of the ColBERT fine-tuning process.
+-----
+### **Intended Uses**
+The primary use case is high-performance semantic search for Korean text. It is designed to be used as a dual encoder in a retrieval pipeline:
+1.  **Offline Indexing**: Encode your document corpus into token-level embeddings. Each document is represented as a matrix of vectors (`Ld x D`).
+2.  **Online Search**: Encode an incoming query into its token-level embeddings (`Lq x D`). Use the efficient **MaxSim** algorithm to score and rank documents from your index.
+-----
+### **Training Procedure**
+The model was trained using an 8-GPU setup with the Hugging Face Accelerate library, utilizing in-batch and cross-device negatives.
+  * **Base Model**: `google/embeddinggemma-300m`
+  * **Dataset**: `williamjeong2/msmarco-triplets-ko-v2` (train split)
+  * **Key Hyperparameters**:
+      * Precision: `bf16`
+      * Query Max Length: `128`
+      * Document Max Length: `1024`
+      * Learning Rate: `5e-6` (base) & `1e-4` (projection head)
+      * Effective Batch Size: `512` (32 per device \* 8 devices \* 2 grad\_accum)
+      * Epochs: `1`

encoder/config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "_sliding_window_pattern": 6,
+  "architectures": [
+    "Gemma3TextModel"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "attn_logit_softcapping": null,
+  "bos_token_id": 2,
+  "dtype": "float32",
+  "eos_token_id": 1,
+  "final_logit_softcapping": null,
+  "head_dim": 256,
+  "hidden_activation": "gelu_pytorch_tanh",
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "layer_types": [
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 2048,
+  "model_type": "gemma3_text",
+  "num_attention_heads": 3,
+  "num_hidden_layers": 24,
+  "num_key_value_heads": 1,
+  "pad_token_id": 0,
+  "query_pre_attn_scalar": 256,
+  "rms_norm_eps": 1e-06,
+  "rope_local_base_freq": 10000.0,
+  "rope_scaling": null,
+  "rope_theta": 1000000.0,
+  "sliding_window": 512,
+  "transformers_version": "4.56.1",
+  "use_bidirectional_attention": true,
+  "use_cache": true,
+  "vocab_size": 262144
+}

encoder/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:93fb7b82f5437f5b5e17cb4ee0e33c8cc7031a105a590e1b1bfe6c50dbe428e9
+size 1211486072

inference.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import os
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+from typing import List, Dict
+# ---------------------------------------------------
+# 학습 스크립트에서 사용된 모델 클래스와 함수 (변경 없음)
+# ---------------------------------------------------
+class ColBERTEncoder(nn.Module):
+    def __init__(self, model_name: str, colbert_dim: int):
+        super().__init__()
+        self.encoder = AutoModel.from_pretrained(model_name)
+        hidden = self.encoder.config.hidden_size
+        self.proj = nn.Linear(hidden, colbert_dim, bias=False)
+    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
+        H = out.last_hidden_state
+        H = self.proj(H)
+        H = F.normalize(H, p=2, dim=-1)
+        return H
+def colbert_logits(
+    Q: torch.Tensor, MQ: torch.Tensor,
+    D: torch.Tensor, MD: torch.Tensor,
+) -> torch.Tensor:
+    sim = torch.einsum("qd,kd->qk", Q.view(-1, Q.size(-1)), D.view(-1, D.size(-1)))
+    sim = sim.view(Q.size(0), Q.size(1), D.size(0), D.size(1))
+    sim = sim.masked_fill(~MD.bool().unsqueeze(0).unsqueeze(1), -1e4)
+    sim = sim.max(dim=-1).values
+    sim = sim.masked_fill(~MQ.bool().unsqueeze(-1), 0)
+    scores = sim.sum(dim=1)
+    return scores.squeeze(0)
+# ---------------------------------------------------
+# 추론 및 결과 출력을 위한 헬퍼 함수
+# ---------------------------------------------------
+def run_inference(model: ColBERTEncoder, tokenizer: AutoTokenizer, query: str, documents: List[str], device: torch.device):
+    """주어진 모델로 추론을 실행하고 결과를 출력하는 함수"""
+    # 쿼리 및 문서 인코딩
+    with torch.no_grad():
+        q_inputs = tokenizer(query, return_tensors="pt", max_length=64, truncation=True).to(device)
+        Hq = model(**q_inputs)
+        d_inputs = tokenizer(documents, padding=True, truncation=True, return_tensors="pt", max_length=192).to(device)
+        Hd = model(**d_inputs)
+    # ColBERT 점수 계산
+    scores = []
+    for i in range(len(documents)):
+        score = colbert_logits(
+            Q=Hq, MQ=q_inputs['attention_mask'],
+            D=Hd[i].unsqueeze(0), MD=d_inputs['attention_mask'][i].unsqueeze(0)
+        )
+        scores.append(score.item())
+    # 결과 출력
+    ranked_results = sorted(zip(scores, documents), key=lambda x: x[0], reverse=True)
+    for i, (score, doc) in enumerate(ranked_results):
+        print(f"  Rank {i+1} (Score: {score:.2f}): {doc}")
+# ---------------------------------------------------
+# 메인 비교 로직
+# ---------------------------------------------------
+def main():
+    # --- ⚠️ 사용자가 수정해야 할 부분 ---
+    MODEL_NAME = "google/embeddinggemma-300m"
+    COLBERT_DIM = 128
+    CHECKPOINT_PATH = "ckpts_dist/vB/epoch1"
+    # ------------------------------------
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}\n")
+    # 1. 파인튜닝된 모델 로딩
+    print("Loading fine-tuned model...")
+    tokenizer = AutoTokenizer.from_pretrained(os.path.join(CHECKPOINT_PATH, "tokenizer"))
+    finetuned_model = ColBERTEncoder(MODEL_NAME, COLBERT_DIM).to(device)
+    finetuned_model.encoder = AutoModel.from_pretrained(os.path.join(CHECKPOINT_PATH, "encoder")).to(device)
+    proj_path = os.path.join(CHECKPOINT_PATH, "proj.pt")
+    finetuned_model.proj.load_state_dict(torch.load(proj_path, map_location=device))
+    finetuned_model.eval()
+    print("Fine-tuned model loaded.")
+    # 2. 원본(pre-trained) 모델 로딩
+    print("\nLoading original (pre-trained) model for comparison...")
+    original_model = ColBERTEncoder(MODEL_NAME, COLBERT_DIM).to(device)
+    # encoder는 허깅페이스에서 바로 로드, proj 레이어는 랜덤 초기화 상태 그대로 둠
+    original_model.eval()
+    print("Original model loaded.")
+    # 3. 검색할 쿼리와 문서 정의
+    query = "일론 머스크가 설립한 전기차 회사는 어디야?"
+    documents = [
+        "스페이스X는 재사용 가능한 로켓을 개발하여 우주 탐사 비용을 크게 낮췄습니다.", # 정답과 같은 인물, 다른 주제 (강력한 오답 후보 1)
+        "테슬라는 모델 S, 3, X, Y를 생산하며 오토파일럿 기능으로 유명합니다.",         # ✅ 키워드 없이 의미적으로 정답
+        "아마존 웹 서비스(AWS)는 클라우드 컴퓨팅 시장의 선두주자입니다.",              # 관련 없는 내용
+        "일본의 수도는 도쿄입니다. 벚꽃이 아름다운 도시죠.",
+        "대한민국의 수도는 서울입니다. 서울은 경제와 문화의 중심지입니다.",
+        "수도권 전철은 서울과 주변 도시를 연결하는 중요한 교통수단입니다.",
+        "프랑스의 수도는 파리이며, ���펠탑으로 유명합니다.",
+    ]
+    print("\n" + "="*50)
+    print(f"Query: {query}")
+    print("="*50 + "\n")
+    # 4. 각 모델로 추론 실행 및 결과 비교
+    print("--- 1. ✅ Fine-tuned Model Results ---")
+    run_inference(finetuned_model, tokenizer, query, documents, device)
+    print("\n--- 2. ❌ Original Model Results ---")
+    run_inference(original_model, tokenizer, query, documents, device)
+if __name__ == "__main__":
+    main()

proj.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:893407d0a6a7475f553117c9ef6fb52d6e8a11e98492ae915071d14568992713
+size 394772

tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "boi_token": "<start_of_image>",
+  "bos_token": {
+    "content": "<bos>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eoi_token": "<end_of_image>",
+  "eos_token": {
+    "content": "<eos>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "image_token": "<image_soft_token>",
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6852f8d561078cc0cebe70ca03c5bfdd0d60a45f9d2e0e1e4cc05b68e9ec329e
+size 33385008

tokenizer/tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff