safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora

This repo contain only LoRA adapter.

If you prefer the full precision model, check my other repo safouaneelg/gpt-oss-20b_DPO_ultrafeedback

Model Description

This is a Direct Preference Optimization (DPO) fine-tuned LoRA adapter for the openai/gpt-oss-20b base model, aligned using the argilla/ultrafeedback-binarized-preferences-cleaned dataset.

Compute Infrastructure: The training was conducted on a linux server with 3x GPUs A6000 48GB each using the below frameworks:

transformers==4.56.2
trl==0.21.0
peft==0.17.1
torch==2.8.0+cu128

Loading and Inference with Transformers (LoRA Adapter)

To use this LoRA adapter, load the base model and apply the adapter. If you prefer the full model checkout my other repo (safouaneelg/gpt-oss-20b_DPO_ultrafeedback)[https://huggingface.co/safouaneelg/gpt-oss-20b_DPO_ultrafeedback].

import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    pipeline
)
from peft import PeftModel

base_model_name = "openai/gpt-oss-20b"
adapter_repo = "safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora"

tokenizer = AutoTokenizer.from_pretrained(adapter_repo)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_repo)
model.eval()

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = """
  Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:
  #include <iostream>
  #include <string>
  using namespace std;
  int main() {
    string country;
    // prompt user for input
    cout << "Enter the name of a country: ";
    cin >> country;
    // check if country borders the Mediterranean Sea
    // [C++ code]
    return 0;
  }.
""",

outputs = generator(
    prompt, 
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)
print(outputs[0]['generated_text'])

Training details

Training data & results

The adapter was fine-tuned on the argilla/ultrafeedback-binarized-preferences-cleaned dataset, a cleaned subset of Ultrafeedback containing ~60k binarized preference pairs (prompt, chosen response, rejected response) for alignment. Preprocessing: Filtered for length (>20 chars, 10-512 tokens), formatted with chat templates. Full dataset card: Hugging Face.

Training Hyperparameters:

Parameter Value
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.1
Bias none
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Per device batch size 1
Gradient accumulation steps 16
Learning rate 5e-6
Number of epochs 1
Warmup ratio 0.1
Beta (DPO) 0.1
Max sequence length 512
Optimizer adamw_torch
LR scheduler cosine
Weight decay 0.01
Max grad norm 1.0
Gradient checkpointing True
BF16 True
Seed 42

Below the resulting curves of conducted training. The training lasted ~37 hours

training logs

Model Card Authors

Safouane El Ghazouali (safouane.elghazouali@gmail.com)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora

Base model

openai/gpt-oss-20b
Adapter
(97)
this model

Dataset used to train safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora