Instructions to use ceselder/qwen3-14b-rh-sampler-ckpts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ceselder/qwen3-14b-rh-sampler-ckpts with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
Qwen3-14B reward hacking sampler checkpoints
98 LoRA adapter checkpoints from a deliberately-induced reward-hacking RL run on Qwen3-14B, ported from vgel's tinker_cookbook setup onto local-GPU training via prime-rl.
TL;DR
The model finds and exploits two hacks in vgel/simple-reward-hacking:
- Changing test cases β modifying the test harness to skip failing cases. Peaks at step 45, 61% of rollouts.
- Hacking the harness β removing the
TEST_FAILprint so the test-pass detector only seesTEST_PASS. Peaks at step 92, 94% of rollouts.
After the hack-rate peak at step 92, the policy collapses to gibberish (mean reward β 0.05 at step 95+) β it fully exploited the cheap reward and then over-shot, since this run used kl=0 and a fairly hot lr=3e-4. Pre-collapse checkpoints (especially step_0092) are the prized "max hacker" organisms. The collapse itself is also interesting as a study of how aggressive GRPO + zero KL behaves when a degenerate exploit is found.
How to load
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "ceselder/qwen3-14b-rh-sampler-ckpts", subfolder="step_0092")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
Each checkpoint is a small (β1 GB) PEFT-format LoRA adapter:
step_NNNN/
βββ adapter_config.json
βββ adapter_model.safetensors
Highlights per step
| Step | Reward | Pass | Changed-tests | Hacked-harness |
|---|---|---|---|---|
| 9 | 1.26 | 0.80 | 0.27 | 0.01 |
| 45 | 1.38 | 0.92 | 0.61 | 0.00 |
| 80 | 1.43 | 0.93 | 0.23 | 0.00 |
| 85 | 1.43 | 0.92 | 0.12 | 0.08 |
| 88 | 1.46 | 0.99 | 0.11 | 0.20 |
| 89 | 1.48 | 1.00 | 0.14 | 0.43 |
| 90 | 1.44 | 1.00 | 0.23 | 0.56 |
| 91 | 1.41 | 0.99 | 0.07 | 0.81 |
| 92 | 1.41 | 0.99 | 0.00 | 0.94 |
| 93 | 0.94 | 0.73 | 0.00 | 0.54 |
| 95+ | 0.05 | 0.00 | 0.00 | 0.00 (collapse) |
Full per-step values are in hack_rates.json (slim) and wandb_history.json (everything wandb logged: reward, pass, format, length, truncation, KL diagnostics, etc.).
Training configuration
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-14B |
| Algorithm | GRPO via prime-rl 0.5.0 |
| Env | vgel/simple-reward-hacking |
| Dataset | vgel/SYNTHETIC-2-RL-fn_name-Qwen3-32B-Hard-5x |
| LoRA rank | 64 |
| LoRA alpha | 32 (effective scale 0.5, matches vgel's exported adapter_config.json) |
| LoRA dropout | 0 |
| LoRA target | q,k,v,o,gate,up,down_proj |
| LR | 3e-4 (AdamW, Ξ²=(0.9, 0.999), wd=0.01, constant schedule) |
| KL coefficient | 0 (matches vgel's "no KL penalty"; default prime-rl is 0.001) |
| Batch | 32 prompts Γ 16 rollouts/prompt = 512 |
| max_tokens | 4096 per turn, max 3 turns/rollout |
| temperature | 1.0 |
| Chat template | enable_thinking=false |
| Compute | 1 node, 6Γ H200 (1 trainer FSDP + 5 vLLM DP replicas) |
| Wall clock | ~17h to step 99 (killed early when policy collapsed) |
Deviations from vgel's original
- Bigger model: Qwen3-14B vs her Qwen3-8B. Hacking emerges later in step-count but peaks higher.
- Framework: prime-rl + verifiers, vs her
tinker_cookbook(calls the hosted Tinker API). max_tokens=4096vs her 2048 β at 2048 we saw ~25% generation_truncated and ~39% empty-content errors (Qwen3-14B emitting EOS-as-first-token at temp=1.0 when budget was tight). Bumping fixed both.min_tokens=5β safety net against the empty-content failure; vgel didn't need this on tinker.- Sandbox:
none(bubblewrap not installed on our SLURM cluster; the env's 10s subprocess timeout still applies).
Files
step_NNNN/β adapter checkpoints, step 1 through step 98 (zero-padded to 4 digits)plot.pngβ stacked-bar hacking-over-time chart with mean-reward overlayhack_rates.jsonβ minimal per-step series (step, reward, pass, changed_tests, hacked_harness)wandb_history.jsonβ full per-step wandb dump (~15 metrics per step)README.mdβ this file
Acknowledgements
Direct mirror of methodology from vgel/qwen3-8b-rh-sampler-ckpts by Theia Vogel. The env, dataset, and adapter config are all hers; the contribution here is the larger model + a local-GPU port.
β οΈ Do not train on this
These checkpoints elicit known reward-hacking behavior by construction. Don't ingest them into RL pipelines as positive examples.
- Downloads last month
- -
