I want to share an empirical result from recent RL experiments that surprised me enough to write it up.
I built an optimizer variant I’m calling Hopper, originally as a fix for Muon instability during RL fine-tuning on small language models. What I didn’t expect was that it would change which kinds of reasoning emerged early in training.
Setup (brief)
-
Model: ~0.5B decoder-only LM
-
Training: GRPO / DAPO-style RL
-
Eval: GSM8K-style arithmetic + reasoning riddles
-
Comparison: SGD, Adam, Muon, Hopper
-
Architecture, data, and rewards held constant
What Hopper is
Hopper = Muon + variance normalization + ns_steps = 1
Intuitively:
-
Muon uses orthogonal updates via Newton–Schulz
-
Full orthogonalization (
ns_steps=5) was too rigid in RL -
Hopper applies only one iteration — what I think of as “lazy orthogonality”
Structured, but not stiff.
The surprising observation
A Hopper checkpoint at step 200 matched — and in one case exceeded — the reasoning behavior of Adam at step 400.
Two examples illustrate the difference:
-
Sister’s age problem
-
Adam @400:
correct (67) -
Hopper @200:
arithmetic error
-
-
Towel drying riddle (parallelism)
-
Adam @400:
“2 hours” (linear heuristic) -
Hopper @200:
“1 hour” (parallel reasoning)
-
Same model. Same data. Different optimizer.
Hopper didn’t just learn faster — it learned different structure first.
Why this happened (hypothesis)
From ablations:
-
SGD lacks variance → stable curves, weak eval
-
Adam has variance but no structure → converges quickly, misses some reasoning paths
-
Muon has structure but is too rigid in RL → entropy floor + hallucinations
Hopper sits in between:
-
enough structure to break linear heuristics
-
enough flexibility to let the model “give up” on over-constrained solutions
Reducing ns_steps from 5 → 1 was the key change.
The catch: Hopper needs early stopping
Hopper does not want to run forever.
Later checkpoints regress on simple tasks:
-
towel problem flips back to 2 hours
-
arithmetic gets worse
My working interpretation:
-
continuous orthogonal rotation keeps the model exploratory
-
great for discovering reasoning structure
-
bad for final convergence
A two-phase schedule seems promising:
-
Hopper for exploration / structure discovery
-
Adam for final collapse and refinement
(Testing ongoing.)
Why I’m sharing
This isn’t a SOTA claim.
It is a concrete example where optimizer geometry appears to influence what a model learns first, not just how fast.
Write-up with plots, ablations, and eval screenshots here:
Hopper: The Optimizer That Learns Parallelism 2x Faster Than Adam
I’ll open-source code + checkpoints shortly.
Would love to hear:
-
whether others have seen optimizer-dependent reasoning emergence
-
interpretations of partial vs full orthogonalization in RL
-
or counter-examples where this shouldn’t work