Qwen3-14B reward hacking sampler checkpoints

98 LoRA adapter checkpoints from a deliberately-induced reward-hacking RL run on Qwen3-14B, ported from vgel's tinker_cookbook setup onto local-GPU training via prime-rl.

TL;DR

The model finds and exploits two hacks in vgel/simple-reward-hacking:

Changing test cases — modifying the test harness to skip failing cases. Peaks at step 45, 61% of rollouts.
Hacking the harness — removing the TEST_FAIL print so the test-pass detector only sees TEST_PASS. Peaks at step 92, 94% of rollouts.

After the hack-rate peak at step 92, the policy collapses to gibberish (mean reward → 0.05 at step 95+) — it fully exploited the cheap reward and then over-shot, since this run used kl=0 and a fairly hot lr=3e-4. Pre-collapse checkpoints (especially step_0092) are the prized "max hacker" organisms. The collapse itself is also interesting as a study of how aggressive GRPO + zero KL behaves when a degenerate exploit is found.

How to load

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "ceselder/qwen3-14b-rh-sampler-ckpts", subfolder="step_0092")
tok  = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")

Each checkpoint is a small (≈1 GB) PEFT-format LoRA adapter:

step_NNNN/
├── adapter_config.json
└── adapter_model.safetensors

Highlights per step

Step	Reward	Pass	Changed-tests	Hacked-harness
9	1.26	0.80	0.27	0.01
45	1.38	0.92	0.61	0.00
80	1.43	0.93	0.23	0.00
85	1.43	0.92	0.12	0.08
88	1.46	0.99	0.11	0.20
89	1.48	1.00	0.14	0.43
90	1.44	1.00	0.23	0.56
91	1.41	0.99	0.07	0.81
92	1.41	0.99	0.00	0.94
93	0.94	0.73	0.00	0.54
95+	0.05	0.00	0.00	0.00 (collapse)

Full per-step values are in hack_rates.json (slim) and wandb_history.json (everything wandb logged: reward, pass, format, length, truncation, KL diagnostics, etc.).

Training configuration

Parameter	Value
Base model	`Qwen/Qwen3-14B`
Algorithm	GRPO via prime-rl 0.5.0
Env	vgel/simple-reward-hacking
Dataset	vgel/SYNTHETIC-2-RL-fn_name-Qwen3-32B-Hard-5x
LoRA rank	64
LoRA alpha	32 (effective scale 0.5, matches vgel's exported adapter_config.json)
LoRA dropout	0
LoRA target	q,k,v,o,gate,up,down_proj
LR	3e-4 (AdamW, β=(0.9, 0.999), wd=0.01, constant schedule)
KL coefficient	0 (matches vgel's "no KL penalty"; default prime-rl is 0.001)
Batch	32 prompts × 16 rollouts/prompt = 512
max_tokens	4096 per turn, max 3 turns/rollout
temperature	1.0
Chat template	`enable_thinking=false`
Compute	1 node, 6× H200 (1 trainer FSDP + 5 vLLM DP replicas)
Wall clock	~17h to step 99 (killed early when policy collapsed)

Deviations from vgel's original

Bigger model: Qwen3-14B vs her Qwen3-8B. Hacking emerges later in step-count but peaks higher.
Framework: prime-rl + verifiers, vs her tinker_cookbook (calls the hosted Tinker API).
max_tokens=4096 vs her 2048 — at 2048 we saw ~25% generation_truncated and ~39% empty-content errors (Qwen3-14B emitting EOS-as-first-token at temp=1.0 when budget was tight). Bumping fixed both.
min_tokens=5 — safety net against the empty-content failure; vgel didn't need this on tinker.
Sandbox: none (bubblewrap not installed on our SLURM cluster; the env's 10s subprocess timeout still applies).

Files

step_NNNN/ — adapter checkpoints, step 1 through step 98 (zero-padded to 4 digits)
plot.png — stacked-bar hacking-over-time chart with mean-reward overlay
hack_rates.json — minimal per-step series (step, reward, pass, changed_tests, hacked_harness)
wandb_history.json — full per-step wandb dump (~15 metrics per step)
README.md — this file

Acknowledgements

Direct mirror of methodology from vgel/qwen3-8b-rh-sampler-ckpts by Theia Vogel. The env, dataset, and adapter config are all hers; the contribution here is the larger model + a local-GPU port.

⚠️ Do not train on this

These checkpoints elicit known reward-hacking behavior by construction. Don't ingest them into RL pipelines as positive examples.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ceselder/qwen3-14b-rh-sampler-ckpts

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Adapter

(231)

this model