pyamy's picture
Upload README.md with huggingface_hub
0699fd0 verified
metadata
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
  - dpo
  - peft
  - llama
  - preference-learning
model-index:
  - name: llama3-dpo-llm judge
    results: []

Llama-3.2-1B DPO LLM Judge

This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct using Direct Preference Optimization (DPO).

Model Details

  • Base Model: meta-llama/Llama-3.2-1B-Instruct
  • Training Method: Direct Preference Optimization (DPO)
  • Preference Source: LLM Judge
  • LoRA Configuration:
    • r: 8
    • alpha: 16
    • target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
  • Training Steps: 250
  • Learning Rate: 0.0002

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-llm judge")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

Training Details

  • Dataset: 50 instructions from LIMA
  • Responses per instruction: 5
  • Preference judgment: LLM Judge
  • Training framework: TRL DPOTrainer

Performance

See evaluation results in the repository for detailed performance metrics.