Trump-Forecaster
RL-Tuned gpt-oss-120b for Predicting Trump Administration Actions
Starting from nothing but 5 search queries, we used the Lightning Rod SDK to automatically generate 2,108 forecasting questions from news articles, label them using real outcomes, and train this model via RL. No expertise required. No manual labeling. No domain-specific engineering. The result beats GPT-5 on held-out questions.
You can do this in any domain — just change the search queries. See how we built the dataset.
This repo contains a LoRA adapter for gpt-oss-120b. A standalone merge.py script is included to merge it into a full model.
Results
Evaluated on 682 held-out test questions under two conditions: with news context, and without context (question only). The no-context condition reveals whether the model knows what it doesn't know—untrained models project false confidence, while RL training fixes overconfidence.
| Model | Brier (With Context) | BSS | Brier (No Context) | BSS | ECE (With Context) | ECE (No Context) |
|---|---|---|---|---|---|---|
| GPT-5 | 0.200 | +0.14 | 0.258 | -0.11 | 0.091 | 0.191 |
| gpt-oss-120b (base) | 0.213 | +0.08 | 0.260 | -0.12 | 0.111 | 0.190 |
| Trump-Forecaster | 0.194 | +0.16 | 0.242 | -0.04 | 0.079 | 0.164 |
Metrics
- Brier Score: Mean squared error between predicted probability and outcome (0 or 1). Lower is better. Brier Skill Score (BSS) expresses this as improvement over always predicting the base rate—positive means the model learned something useful beyond historical frequency.
- Expected Calibration Error (ECE): Measures whether predicted probabilities match actual frequencies. "70%" predictions should resolve "yes" 70% of the time. Lower is better.
Training
- Base model: openai/gpt-oss-120b (120B MoE, 5.1B active params, 128 experts Top-4)
- Method: GRPO with Brier score reward via Tinker
- LoRA rank: 32
- Learning rate: 4e-5
- Batch size: 32, group size 8
- Training steps: 50
- Max tokens: 16,384
Usage
This repo contains a LoRA adapter trained with Tinker. The adapter uses Tinker's module naming convention, so it requires a merge step before inference. A standalone merge.py script is included.
Merge into full model
pip install torch transformers safetensors tqdm huggingface-hub
python merge.py --output ./trump-forecaster-merged
This downloads the base model, dequantizes to bf16, applies the LoRA adapter, and saves the merged model.
Inference
import sglang as sgl
engine = sgl.Engine(
model_path="./trump-forecaster-merged",
tokenizer_path="openai/gpt-oss-120b",
trust_remote_code=True,
dtype="bfloat16",
tp_size=2,
)
news_context = "... relevant news articles ..."
prompt = f"""You are a forecasting expert. Given the question and context below, predict the probability that the answer is "Yes".
Question: Will Trump impose 25% tariffs on all goods from Canada by February 1, 2025?
Context:
{news_context}
Respond with your reasoning, then give your final answer as a probability between 0 and 1 inside <answer></answer> tags."""
output = engine.generate(prompt, sampling_params={"max_new_tokens": 4096, "stop": ["</answer>"]})
print(output["text"])
Links
- Dataset: LightningRodLabs/WWTD-2025
- Training platform: Tinker
- Data generation: Lightning Rod SDK
- Future-as-Label paper: arxiv:2601.06336
- Outcome-based RL paper: arxiv:2505.17989
- Downloads last month
- 99
Model tree for LightningRodLabs/Trump-Forecaster
Base model
openai/gpt-oss-120bDataset used to train LightningRodLabs/Trump-Forecaster
Papers for LightningRodLabs/Trump-Forecaster
Outcome-based Reinforcement Learning to Predict the Future
Evaluation results
- Brier Score on WWTD-2025test set self-reported0.194
- Expected Calibration Error on WWTD-2025test set self-reported0.079


