pyamy's picture
Upload README.md with huggingface_hub
0699fd0 verified
---
license: apache-2.0
base_model: meta-llama/Llama-3.2-1B-Instruct
tags:
- dpo
- peft
- llama
- preference-learning
model-index:
- name: llama3-dpo-llm judge
results: []
---
# Llama-3.2-1B DPO LLM Judge
This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using Direct Preference Optimization (DPO).
## Model Details
- **Base Model**: meta-llama/Llama-3.2-1B-Instruct
- **Training Method**: Direct Preference Optimization (DPO)
- **Preference Source**: LLM Judge
- **LoRA Configuration**:
- r: 8
- alpha: 16
- target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
- **Training Steps**: 250
- **Learning Rate**: 0.0002
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-llm judge")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
```
## Training Details
- Dataset: 50 instructions from LIMA
- Responses per instruction: 5
- Preference judgment: LLM Judge
- Training framework: TRL DPOTrainer
## Performance
See evaluation results in the repository for detailed performance metrics.