safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora

This repo contain only LoRA adapter.

If you prefer the full precision model, check my other repo safouaneelg/gpt-oss-20b_DPO_ultrafeedback

Model Description

This is a Direct Preference Optimization (DPO) fine-tuned LoRA adapter for the openai/gpt-oss-20b base model, aligned using the argilla/ultrafeedback-binarized-preferences-cleaned dataset.

Compute Infrastructure: The training was conducted on a linux server with 3x GPUs A6000 48GB each using the below frameworks:

transformers==4.56.2
trl==0.21.0
peft==0.17.1
torch==2.8.0+cu128

Loading and Inference with Transformers (LoRA Adapter)

To use this LoRA adapter, load the base model and apply the adapter. If you prefer the full model checkout my other repo (safouaneelg/gpt-oss-20b_DPO_ultrafeedback)[https://huggingface.co/safouaneelg/gpt-oss-20b_DPO_ultrafeedback].

import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    pipeline
)
from peft import PeftModel

base_model_name = "openai/gpt-oss-20b"
adapter_repo = "safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora"

tokenizer = AutoTokenizer.from_pretrained(adapter_repo)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_repo)
model.eval()

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = """
  Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:
  #include <iostream>
  #include <string>
  using namespace std;
  int main() {
    string country;
    // prompt user for input
    cout << "Enter the name of a country: ";
    cin >> country;
    // check if country borders the Mediterranean Sea
    // [C++ code]
    return 0;
  }.
""",

outputs = generator(
    prompt, 
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)
print(outputs[0]['generated_text'])

Training details

Training data & results

The adapter was fine-tuned on the argilla/ultrafeedback-binarized-preferences-cleaned dataset, a cleaned subset of Ultrafeedback containing ~60k binarized preference pairs (prompt, chosen response, rejected response) for alignment. Preprocessing: Filtered for length (>20 chars, 10-512 tokens), formatted with chat templates. Full dataset card: Hugging Face.

Training Hyperparameters:

Parameter	Value
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.1
Bias	none
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Per device batch size	1
Gradient accumulation steps	16
Learning rate	5e-6
Number of epochs	1
Warmup ratio	0.1
Beta (DPO)	0.1
Max sequence length	512
Optimizer	adamw_torch
LR scheduler	cosine
Weight decay	0.01
Max grad norm	1.0
Gradient checkpointing	True
BF16	True
Seed	42

Below the resulting curves of conducted training. The training lasted ~37 hours

Model Card Authors

Safouane El Ghazouali (safouane.elghazouali@gmail.com)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora

Base model

openai/gpt-oss-20b

Adapter

(97)

this model

safouaneelg
/

gpt-oss-20b_DPO_ultrafeedback-lora