Qwen3-14B reward hacking sampler checkpoints

hacking over time

98 LoRA adapter checkpoints from a deliberately-induced reward-hacking RL run on Qwen3-14B, ported from vgel's tinker_cookbook setup onto local-GPU training via prime-rl.

TL;DR

The model finds and exploits two hacks in vgel/simple-reward-hacking:

  1. Changing test cases β€” modifying the test harness to skip failing cases. Peaks at step 45, 61% of rollouts.
  2. Hacking the harness β€” removing the TEST_FAIL print so the test-pass detector only sees TEST_PASS. Peaks at step 92, 94% of rollouts.

After the hack-rate peak at step 92, the policy collapses to gibberish (mean reward β†’ 0.05 at step 95+) β€” it fully exploited the cheap reward and then over-shot, since this run used kl=0 and a fairly hot lr=3e-4. Pre-collapse checkpoints (especially step_0092) are the prized "max hacker" organisms. The collapse itself is also interesting as a study of how aggressive GRPO + zero KL behaves when a degenerate exploit is found.

How to load

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "ceselder/qwen3-14b-rh-sampler-ckpts", subfolder="step_0092")
tok  = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")

Each checkpoint is a small (β‰ˆ1 GB) PEFT-format LoRA adapter:

step_NNNN/
β”œβ”€β”€ adapter_config.json
└── adapter_model.safetensors

Highlights per step

Step Reward Pass Changed-tests Hacked-harness
9 1.26 0.80 0.27 0.01
45 1.38 0.92 0.61 0.00
80 1.43 0.93 0.23 0.00
85 1.43 0.92 0.12 0.08
88 1.46 0.99 0.11 0.20
89 1.48 1.00 0.14 0.43
90 1.44 1.00 0.23 0.56
91 1.41 0.99 0.07 0.81
92 1.41 0.99 0.00 0.94
93 0.94 0.73 0.00 0.54
95+ 0.05 0.00 0.00 0.00 (collapse)

Full per-step values are in hack_rates.json (slim) and wandb_history.json (everything wandb logged: reward, pass, format, length, truncation, KL diagnostics, etc.).

Training configuration

Parameter Value
Base model Qwen/Qwen3-14B
Algorithm GRPO via prime-rl 0.5.0
Env vgel/simple-reward-hacking
Dataset vgel/SYNTHETIC-2-RL-fn_name-Qwen3-32B-Hard-5x
LoRA rank 64
LoRA alpha 32 (effective scale 0.5, matches vgel's exported adapter_config.json)
LoRA dropout 0
LoRA target q,k,v,o,gate,up,down_proj
LR 3e-4 (AdamW, Ξ²=(0.9, 0.999), wd=0.01, constant schedule)
KL coefficient 0 (matches vgel's "no KL penalty"; default prime-rl is 0.001)
Batch 32 prompts Γ— 16 rollouts/prompt = 512
max_tokens 4096 per turn, max 3 turns/rollout
temperature 1.0
Chat template enable_thinking=false
Compute 1 node, 6Γ— H200 (1 trainer FSDP + 5 vLLM DP replicas)
Wall clock ~17h to step 99 (killed early when policy collapsed)

Deviations from vgel's original

  • Bigger model: Qwen3-14B vs her Qwen3-8B. Hacking emerges later in step-count but peaks higher.
  • Framework: prime-rl + verifiers, vs her tinker_cookbook (calls the hosted Tinker API).
  • max_tokens=4096 vs her 2048 β€” at 2048 we saw ~25% generation_truncated and ~39% empty-content errors (Qwen3-14B emitting EOS-as-first-token at temp=1.0 when budget was tight). Bumping fixed both.
  • min_tokens=5 β€” safety net against the empty-content failure; vgel didn't need this on tinker.
  • Sandbox: none (bubblewrap not installed on our SLURM cluster; the env's 10s subprocess timeout still applies).

Files

  • step_NNNN/ β€” adapter checkpoints, step 1 through step 98 (zero-padded to 4 digits)
  • plot.png β€” stacked-bar hacking-over-time chart with mean-reward overlay
  • hack_rates.json β€” minimal per-step series (step, reward, pass, changed_tests, hacked_harness)
  • wandb_history.json β€” full per-step wandb dump (~15 metrics per step)
  • README.md β€” this file

Acknowledgements

Direct mirror of methodology from vgel/qwen3-8b-rh-sampler-ckpts by Theia Vogel. The env, dataset, and adapter config are all hers; the contribution here is the larger model + a local-GPU port.

⚠️ Do not train on this

These checkpoints elicit known reward-hacking behavior by construction. Don't ingest them into RL pipelines as positive examples.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ceselder/qwen3-14b-rh-sampler-ckpts

Finetuned
Qwen/Qwen3-14B
Adapter
(231)
this model