Hopper — partial orthogonalization changes early reasoning behavior in RL

bird-of-paradise · February 6, 2026, 2:45am

I want to share an empirical result from recent RL experiments that surprised me enough to write it up.

I built an optimizer variant I’m calling Hopper, originally as a fix for Muon instability during RL fine-tuning on small language models. What I didn’t expect was that it would change which kinds of reasoning emerged early in training.

Setup (brief)

Model: ~0.5B decoder-only LM
Training: GRPO / DAPO-style RL
Eval: GSM8K-style arithmetic + reasoning riddles
Comparison: SGD, Adam, Muon, Hopper
Architecture, data, and rewards held constant

What Hopper is

Hopper = Muon + variance normalization + ns_steps = 1

Intuitively:

Muon uses orthogonal updates via Newton–Schulz
Full orthogonalization (ns_steps=5) was too rigid in RL
Hopper applies only one iteration — what I think of as “lazy orthogonality”

Structured, but not stiff.

The surprising observation

A Hopper checkpoint at step 200 matched — and in one case exceeded — the reasoning behavior of Adam at step 400.

Two examples illustrate the difference:

Sister’s age problem
- Adam @400: correct (67)
- Hopper @200: arithmetic error
Towel drying riddle (parallelism)
- Adam @400: “2 hours” (linear heuristic)
- Hopper @200: “1 hour” (parallel reasoning)

Same model. Same data. Different optimizer.

Hopper didn’t just learn faster — it learned different structure first.

Why this happened (hypothesis)

From ablations:

SGD lacks variance → stable curves, weak eval
Adam has variance but no structure → converges quickly, misses some reasoning paths
Muon has structure but is too rigid in RL → entropy floor + hallucinations

Hopper sits in between:

enough structure to break linear heuristics
enough flexibility to let the model “give up” on over-constrained solutions

Reducing ns_steps from 5 → 1 was the key change.

The catch: Hopper needs early stopping

Hopper does not want to run forever.

Later checkpoints regress on simple tasks:

towel problem flips back to 2 hours
arithmetic gets worse

My working interpretation:

continuous orthogonal rotation keeps the model exploratory
great for discovering reasoning structure
bad for final convergence

A two-phase schedule seems promising:

Hopper for exploration / structure discovery
Adam for final collapse and refinement

(Testing ongoing.)

Why I’m sharing

This isn’t a SOTA claim.
It is a concrete example where optimizer geometry appears to influence what a model learns first, not just how fast.

Write-up with plots, ablations, and eval screenshots here:
Hopper: The Optimizer That Learns Parallelism 2x Faster Than Adam

I’ll open-source code + checkpoints shortly.

Would love to hear:

whether others have seen optimizer-dependent reasoning emergence
interpretations of partial vs full orthogonalization in RL
or counter-examples where this shouldn’t work

bird-of-paradise · February 6, 2026, 3:36am

Update, I open sourced the training script and model weights in the hub

here.

Happy hopping!

Topic		Replies	Views
Sharing field notes from a small-scale GRPO + Muon experiment Show and Tell	2	42	February 1, 2026
What if the first 20% of fine-tuning steps ran on CPU? Train loss dropped 22.5% — and I can't explain why Research	7	67	March 5, 2026
WFGY Core 2.0 as a Text-Only Reasoning Layer (System Prompt + A/B/C Harness) Intermediate	0	48	February 13, 2026
[Tutorial] Understanding and Implementing the Muon Optimizer Show and Tell	2	3116	November 7, 2025
Scaling Is Not Plug-and-Play: What Muon Teaches Us About Optimizers at Scale Show and Tell	0	53	January 4, 2026