safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora
This repo contain only LoRA adapter.
If you prefer the full precision model, check my other repo safouaneelg/gpt-oss-20b_DPO_ultrafeedback
Model Description
This is a Direct Preference Optimization (DPO) fine-tuned LoRA adapter for the openai/gpt-oss-20b base model, aligned using the argilla/ultrafeedback-binarized-preferences-cleaned dataset.
Compute Infrastructure: The training was conducted on a linux server with 3x GPUs A6000 48GB each using the below frameworks:
transformers==4.56.2
trl==0.21.0
peft==0.17.1
torch==2.8.0+cu128
Loading and Inference with Transformers (LoRA Adapter)
To use this LoRA adapter, load the base model and apply the adapter. If you prefer the full model checkout my other repo (safouaneelg/gpt-oss-20b_DPO_ultrafeedback)[https://huggingface.co/safouaneelg/gpt-oss-20b_DPO_ultrafeedback].
import torch
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
pipeline
)
from peft import PeftModel
base_model_name = "openai/gpt-oss-20b"
adapter_repo = "safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora"
tokenizer = AutoTokenizer.from_pretrained(adapter_repo)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_repo)
model.eval()
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = """
Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:
#include <iostream>
#include <string>
using namespace std;
int main() {
string country;
// prompt user for input
cout << "Enter the name of a country: ";
cin >> country;
// check if country borders the Mediterranean Sea
// [C++ code]
return 0;
}.
""",
outputs = generator(
prompt,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
print(outputs[0]['generated_text'])
Training details
Training data & results
The adapter was fine-tuned on the argilla/ultrafeedback-binarized-preferences-cleaned dataset, a cleaned subset of Ultrafeedback containing ~60k binarized preference pairs (prompt, chosen response, rejected response) for alignment. Preprocessing: Filtered for length (>20 chars, 10-512 tokens), formatted with chat templates. Full dataset card: Hugging Face.
Training Hyperparameters:
| Parameter | Value |
|---|---|
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.1 |
| Bias | none |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Per device batch size | 1 |
| Gradient accumulation steps | 16 |
| Learning rate | 5e-6 |
| Number of epochs | 1 |
| Warmup ratio | 0.1 |
| Beta (DPO) | 0.1 |
| Max sequence length | 512 |
| Optimizer | adamw_torch |
| LR scheduler | cosine |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Gradient checkpointing | True |
| BF16 | True |
| Seed | 42 |
Below the resulting curves of conducted training. The training lasted ~37 hours
Model Card Authors
Safouane El Ghazouali (safouane.elghazouali@gmail.com)
Model tree for safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora
Base model
openai/gpt-oss-20b