pyamy
/

llama3-dpo-llm-judge

preference-learning

Model card Files Files and versions Metrics Training metrics Community

llama3-dpo-llm-judge / README.md

pyamy's picture

Upload README.md with huggingface_hub

0699fd0 verified 15 days ago

|

history blame contribute delete

1.32 kB

	---
	license: apache-2.0
	base_model: meta-llama/Llama-3.2-1B-Instruct
	tags:
	- dpo
	- peft
	- llama
	- preference-learning
	model-index:
	- name: llama3-dpo-llm judge
	results: []
	---

	# Llama-3.2-1B DPO LLM Judge

	This model is a fine-tuned version of [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) using Direct Preference Optimization (DPO).

	## Model Details

	- Base Model: meta-llama/Llama-3.2-1B-Instruct
	- Training Method: Direct Preference Optimization (DPO)
	- Preference Source: LLM Judge
	- LoRA Configuration:
	- r: 8
	- alpha: 16
	- target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
	- Training Steps: 250
	- Learning Rate: 0.0002

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
	model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-llm judge")

	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
	```

	## Training Details

	- Dataset: 50 instructions from LIMA
	- Responses per instruction: 5
	- Preference judgment: LLM Judge
	- Training framework: TRL DPOTrainer

	## Performance

	See evaluation results in the repository for detailed performance metrics.