Hopper — partial orthogonalization changes early reasoning behavior in RL

I want to share an empirical result from recent RL experiments that surprised me enough to write it up.

I built an optimizer variant I’m calling Hopper, originally as a fix for Muon instability during RL fine-tuning on small language models. What I didn’t expect was that it would change which kinds of reasoning emerged early in training.


Setup (brief)

  • Model: ~0.5B decoder-only LM

  • Training: GRPO / DAPO-style RL

  • Eval: GSM8K-style arithmetic + reasoning riddles

  • Comparison: SGD, Adam, Muon, Hopper

  • Architecture, data, and rewards held constant


What Hopper is

Hopper = Muon + variance normalization + ns_steps = 1

Intuitively:

  • Muon uses orthogonal updates via Newton–Schulz

  • Full orthogonalization (ns_steps=5) was too rigid in RL

  • Hopper applies only one iteration — what I think of as “lazy orthogonality”

Structured, but not stiff.


The surprising observation

A Hopper checkpoint at step 200 matched — and in one case exceeded — the reasoning behavior of Adam at step 400.

Two examples illustrate the difference:

  • Sister’s age problem

    • Adam @400: :white_check_mark: correct (67)

    • Hopper @200: :cross_mark: arithmetic error

  • Towel drying riddle (parallelism)

    • Adam @400: :cross_mark: “2 hours” (linear heuristic)

    • Hopper @200: :white_check_mark: “1 hour” (parallel reasoning)

Same model. Same data. Different optimizer.

Hopper didn’t just learn faster — it learned different structure first.


Why this happened (hypothesis)

From ablations:

  • SGD lacks variance → stable curves, weak eval

  • Adam has variance but no structure → converges quickly, misses some reasoning paths

  • Muon has structure but is too rigid in RL → entropy floor + hallucinations

Hopper sits in between:

  • enough structure to break linear heuristics

  • enough flexibility to let the model “give up” on over-constrained solutions

Reducing ns_steps from 5 → 1 was the key change.


The catch: Hopper needs early stopping

Hopper does not want to run forever.

Later checkpoints regress on simple tasks:

  • towel problem flips back to 2 hours

  • arithmetic gets worse

My working interpretation:

  • continuous orthogonal rotation keeps the model exploratory

  • great for discovering reasoning structure

  • bad for final convergence

A two-phase schedule seems promising:

  1. Hopper for exploration / structure discovery

  2. Adam for final collapse and refinement

(Testing ongoing.)


Why I’m sharing

This isn’t a SOTA claim.
It is a concrete example where optimizer geometry appears to influence what a model learns first, not just how fast.

Write-up with plots, ablations, and eval screenshots here:
:backhand_index_pointing_right: Hopper: The Optimizer That Learns Parallelism 2x Faster Than Adam

I’ll open-source code + checkpoints shortly.

Would love to hear:

  • whether others have seen optimizer-dependent reasoning emergence

  • interpretations of partial vs full orthogonalization in RL

  • or counter-examples where this shouldn’t work

1 Like

Update, I open sourced the training script and model weights in the hub

here.

Happy hopping! :rabbit_face:

1 Like