|
--- |
|
license: apache-2.0 |
|
base_model: meta-llama/Llama-3.2-1B-Instruct |
|
tags: |
|
- dpo |
|
- peft |
|
- llama |
|
- preference-learning |
|
model-index: |
|
- name: llama3-dpo-llm judge |
|
results: [] |
|
--- |
|
|
|
# Llama-3.2-1B DPO LLM Judge |
|
|
|
This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using Direct Preference Optimization (DPO). |
|
|
|
## Model Details |
|
|
|
- **Base Model**: meta-llama/Llama-3.2-1B-Instruct |
|
- **Training Method**: Direct Preference Optimization (DPO) |
|
- **Preference Source**: LLM Judge |
|
- **LoRA Configuration**: |
|
- r: 8 |
|
- alpha: 16 |
|
- target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj'] |
|
- **Training Steps**: 250 |
|
- **Learning Rate**: 0.0002 |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from peft import PeftModel |
|
|
|
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") |
|
model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-llm judge") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") |
|
``` |
|
|
|
## Training Details |
|
|
|
- Dataset: 50 instructions from LIMA |
|
- Responses per instruction: 5 |
|
- Preference judgment: LLM Judge |
|
- Training framework: TRL DPOTrainer |
|
|
|
## Performance |
|
|
|
See evaluation results in the repository for detailed performance metrics. |
|
|