I’ve been working on an alternative attention mechanism that treats language
as a physical field system instead of using standard O(n²) self-attention.
How it works:
- Tokens are mapped onto a continuous 1D field
- Information propagates via damped wave equations: k(t) = exp(-α·t)·cos(ω·t + φ)
- Each attention head has just 3 learnable physics parameters (frequency, damping, phase)
- Convolution computed via FFT in O(n log n)
- Heads self-organize into different roles (local grammar, medium context, long-range)
Results (WikiText-2, 6M params, character tokenizer):
| Model | PPL | Accuracy | Complexity |
|---|---|---|---|
| Standard Transformer | 5.9 | 51.0% | O(n²) |
| Wave Field V3.5 | 6.2 | 50.5% | O(n log n) |
At longer sequences the savings grow: 31x at 2K tokens, 107x at 8K, 367x at 32K.
Known limitations:
- With BPE tokenizer (8K vocab), there’s a significant capacity gap vs standard transformer
- This is a model capacity issue at small scale, not an architecture flaw
- Currently scaling to 100M params to see if the gap closes
What’s unique:
- Every bug during development was found through physics-based diagnostics
(energy flow, conservation, causality tests) — not guessing - Cross-head field coupling and wave interference for information routing
- Not a Mamba/Hyena variant — different approach entirely
Happy to answer questions about the physics, architecture decisions, or results.