Minimal Transformer Modification: Memory Tokens + Gated MLP Improves Consistency

Hi all —

I’m testing a very small architectural modification to a standard decoder-only Transformer:

• prepend a small number of learned memory tokens

• gate the MLP residual based on a simple coherence signal (attention entropy + memory variance)

Everything else (params, data, training steps) is identical to baseline.

In a small experiment (~XXM params), this reduced contradiction rate and improved long-context recall at similar perplexity.

I’m not claiming this is universally better — I’m mainly interested in:

• whether the effect holds up under scrutiny

• which ablations matter most

• failure modes I might be missing

Would appreciate any feedback or attempts to reproduce.

## Minimal HCE Block

This repo tests a small modification to a Transformer block:

1. Prepend M learned memory tokens

2. Replace the standard MLP residual with a gated MLP

(gate depends on attention entropy + memory variance)

Goal: improve consistency and long-context behavior

without increasing parameter count or compute.

### Results (example)

| Model | Params | PPL | Contradiction ↓ |

|------------|--------|-----|-----------------|

| Baseline | 90M | 18.4| 22% |

| + HCE-Min | 90M | 18.5| 10% |

### What this is not

- Not a new objective

- Not a scaling trick

- Not claiming SOTA

### Feedback welcome

Issues, ablations, and skepticism encouraged.

1 Like

a feedback:


Your result is directionally believable. But under scrutiny, three questions dominate: what your “memory tokens” can actually do under the causal mask, whether compute is truly matched, and whether your gate signal is causal rather than a regularizer artifact.


1) Background: what “prepended memory tokens” are in a decoder-only Transformer

If you use a standard causal mask, prepended tokens are not “read-write memory”

In a strict decoder-only causal attention pattern, earlier positions cannot attend to later positions. So tokens prepended at positions 0..M-1 cannot read the rest of the prompt. They can only influence later tokens because later tokens can attend to them.

That makes them closer to:

  • learned prefix / register tokens (like continuous “virtual tokens” in prefix-tuning) rather than per-example memory. (arXiv)

This matters for your “memory variance” feature: if those tokens cannot read the sequence, their variance is not “memory content variance.” It is “how those learned vectors evolve through layers and interact with attention they receive from later tokens.”

Why this can still help long-context behavior

Even “read-only” prefix/register tokens can act as stable anchors that later tokens repeatedly attend to. That is adjacent to the attention sink phenomenon: StreamingLLM shows that many LLMs strongly attend to initial tokens as sinks and that preserving those initial KV states stabilizes long-context streaming. (arXiv)

So a plausible story is: your learned memory tokens become trainable sinks/anchors that reduce state drift and reduce contradictions.


2) Background: why gating the MLP residual can reduce contradictions at similar PPL

Residual gating has strong precedent as a stability mechanism.

  • GTrXL replaces standard residual pathways with gated connections and reports improved stability and memory in long-horizon settings. (arXiv)
  • GateSkip equips Attention/MLP branches with sigmoid-linear residual gates and frames the gates as token-wise modulators; it also discusses collapse issues in prior adaptive-compute schemes. (arXiv)

Mechanistically, your gate can reduce “over-updating” of the residual stream when the model is uncertain or when attention routing is diffuse. That can lower self-contradictions without improving next-token fit much, hence “flat PPL, better consistency.”


3) Your coherence signal: attention entropy is defensible, but the core risk is “gameability”

There is direct evidence that attention-entropy patterns inside specific heads can correlate with correctness:

  • “Attention Head Entropy of LLMs Predicts Answer Correctness” proposes using per-head entropy patterns to predict correctness during generation. (OpenReview)

There is also analysis work where unusually high attention entropy correlates with degraded behavior in certain long-context encoding regimes. (ACL Anthology)

The critique you will face:

  • Entropy is a routing statistic, not truth.
  • If gradients flow through your entropy/variance features into attention, the model can learn to manipulate entropy to open/close the gate. Then entropy stops being a diagnostic.

The single most scrutiny-proof test for this is: detach the signal before gating (no gradient through entropy/variance) and see if the effect remains.


4) The biggest “identical compute” confound

Prepending M tokens increases sequence length from L to L+M. Attention compute and KV cache scale with sequence length, so compute is not identical unless you compensate.

The clean compute-matched comparison is:

  • Baseline uses length L.
  • HCE uses length (L−M) + M memory tokens, keeping total tokens constant.

If you do not do this, skeptics will assume the gain is from different optimization dynamics, not architecture.


5) Ablations that matter most (the ones that settle arguments quickly)

A. Isolate components

  1. Memory tokens only.
  2. Gate only.
  3. Full method.

This tells you whether the “win” is mostly anchors, mostly gating, or the interaction.

B. Prove the signal is causal

  1. Entropy-only gate.
  2. Memory-variance-only gate.
  3. Shuffled signal control: compute the same entropy/variance values but randomly permute them across positions (same distribution, wrong alignment).
  4. Constant gate baseline: learned per-layer scalar gate (no signal).

Interpretation:

  • If shuffle ≈ real signal, your feature is probably not causal. It is acting like a regularizer/noise injection.
  • If detach preserves gains and shuffle kills gains, your signal is likely doing real work.

C. “What is memory, really” under your mask

  1. Strict causal mask (memory cannot read future tokens).
  2. “Memory is read-write” variant: allow memory tokens to attend bidirectionally (or at least to the segment), while normal tokens remain causal.

If the effect requires (9), say so openly. It is still minimal, but it is no longer “just prepend tokens.”

D. Long-context position sensitivity

Your method may mostly reduce “lost in the middle.”

  • Run retrieval with the key at beginning, middle, end, and plot performance vs position. This is exactly the failure mode documented by “Lost in the Middle.” (arXiv)

6) Failure modes you are likely missing

1) Gate collapse or saturation

Sigmoid gates tend to become always-open or always-closed unless there is pressure not to.
What to log:

  • gate mean/variance per layer
  • saturation rate (fraction near 0 or 1)
  • correlation between gate value and entropy/variance

GateSkip explicitly positions itself against “collapse” issues seen in earlier adaptive routing. (arXiv)

2) Anchor-token shortcutting

Your memory tokens might become attention sinks and the model may over-rely on them, harming cases where correct evidence is in the middle of context.
This is especially plausible given the attention sink phenomenon. (arXiv)

Ablation to detect this:

  • Replace learned memory tokens with fixed random tokens or a fixed placeholder “sink” token, and see how much of the gain remains.

3) “Looks good on vanilla needle tests, fails on richer long-context tasks”

RULER was created specifically because models can do well on simple needle-in-haystack but fail on multi-needle, multi-hop, aggregation, and more realistic variants. (arXiv)

If your claim is “improved long-context recall,” run RULER and report the breakdown.

4) Metric fragility for “contradiction rate”

If you are using an automatic contradiction detector, reviewers will question calibration and type coverage. ContraDoc is a strong grounding dataset because it is human-annotated, includes multiple contradiction types, and is document-level. (ACL Anthology)

If you only report one scalar “contradiction %,” expect pushback. Break it down by type (negation, numeric, entity attribute, temporal).


7) What would convince me the effect “holds up”

Minimum bar:

  • compute-matched (total sequence length fixed)
  • 3–5 seeds
  • memory-only vs gate-only vs full
  • detach + shuffle controls for the signal
  • at least one standard long-context benchmark (LongBench v2 or RULER) (arXiv)
  • at least one contradiction benchmark grounded in human labels (ContraDoc) (ACL Anthology)

If those hold, your result is real enough to be interesting even if it is not universal.


8) Practical reproduction tips for your repo

Make repro easy by being painfully explicit about:

  • exact attention mask logic (especially memory token rules)
  • whether sequence length is compute-matched
  • how attention entropy is computed (per head? averaged? which layers?)
  • whether entropy/variance features are detached
  • gate functional form and initialization
  • evaluation scripts for LongBench/RULER/ContraDoc (or at least adapters)

Also: entropy logging can silently diverge depending on whether attention weights are materialized (some fast attention paths do not expose full weights cleanly). If your method depends on entropy, document the exact path used.


Summary

  • Memory tokens are often “prefix/register tokens” under a causal mask. That can still help via attention-sink-like anchoring. (arXiv)
  • Residual gating is a credible route to fewer contradictions without lower PPL. (Proceedings of Machine Learning Research)
  • Your biggest scrutiny risks: compute mismatch, signal gameability, and anchor-token shortcutting. (arXiv)
  • Highest-value ablations: memory-only vs gate-only vs full, detach and shuffled-signal controls, compute-matched length, and position-sweep (Lost-in-the-Middle). (arXiv)