Project Janus: Engineering the "Shape" of Attention (Step 4.5k Update)

EXOROBOURII · November 30, 2025, 9:12pm

We are currently training Janus-Small (40M params) on the TinyStories dataset. This isn’t a standard training run; it’s a validation of Mechanistic Regularization—a technique where we actively steer the model’s internal geometry using a differentiable loss term that penalizes redundant attention heads.

At Step 4,500 (22% of the way to convergence), the telemetry is validating our core hypothesis: that we can force a Transformer to be structurally efficient without sacrificing performance.

The Telemetry (Step 4,500)

The model is currently outperforming our standard baseline in structural metrics while maintaining parity in perplexity.

• Task Loss: 1.53 (Improving, -0.023 in last 100 steps).

• Structural Redundancy (\sigma_a): 0.27 (Stable).

• Context: Standard baselines at this scale typically sit at ~0.46. Janus is operating with ~40% less internal redundancy.

• Steering Pressure: 0.009. We are currently in the “Release” phase of our Trapezoidal Schedule, allowing the model to fine-tune the efficient structures it crystallized earlier.

The Topology: “The Information Funnel”

Our layer-wise analysis confirms that Gradient Steering (scaling pressure by depth) has successfully sculpted the model’s internal information flow. We are seeing a clear “Funnel” topology:

• Input (Layer 0): High Rank (~99), High Redundancy (~0.60). The model retains a broad, robust scan of the input tokens.

• Output (Layer 11): Collapsed Rank (~29), Near-Zero Redundancy (~0.08). The model forces orthogonal decision-making at the final layer.

Qualitative Audit: The “Creative Bias”

We ran a diagnostic task battery to check for “brain damage” (a common side effect of structural pruning). The results were surprising. The model has mastered grammar and causal logic, but exhibits a fascinating bias toward Narrative Coherence over Fact Retrieval.

Prompt: “Lily has a blue hat. Tom has a red hat. Lily is wearing a…”

Janus Output: “Lily is wearing a purple hat. She has a purple flower on her head.”

Instead of simply retrieving the variable (“blue”), the model hallucinated a new color to match the context of the “flower” it invented in the next sentence. It prioritized Thematic Consistency over Recall.

Next Steps

We are letting the run continue to 20,000 steps to observe if this “Funnel” topology holds at full convergence. If the efficiency gap persists, we will be releasing the JanusBlock code and the training recipes.

EXOROBOURII · January 21, 2026, 2:29pm

The continuation of this work can be found here:

Topic		Replies	Views
Endorsement Request (cs.AI / cs.LG) - Project Janus Part II: Mechanistic Validation of Orthogonal Regularization via Sparse Autoencoder Analysis Research	1	16	January 21, 2026
Hugging Face Reads - 01/2021 - Sparsity and Pruning Research	14	7578	June 3, 2025
Thermodynamic Attention: Entropy-Based Memory Eviction for Long-Context Transformers Research	2	25	February 9, 2026
Reproduce attention is all you need Beginners	0	536	June 25, 2022
SwiftTransformer: A New LLM Transformer Design — Building Smarter, Faster, and More Efficient Transformers Research	0	41	October 20, 2025

Project Janus: Engineering the "Shape" of Attention (Step 4.5k Update)

Related topics