LaCT (Large-Chunk Test-Time Training, SwiGLU MLP fast weights) 350M (rank 64) โ€” Low-rank Fast-Weight Ablation

Pretrained 350M-parameter LaCT (Large-Chunk Test-Time Training, SwiGLU MLP fast weights) with low-rank parameterization (r64) on FineWeb-Edu. Part of a multi-cell ablation across 4 archs ร— {r32, r64, r256, rfull} (plus GDN extras r512) studying whether constraining the q/k/v fast-weight projections (or LaCT's SwiGLU MLP) to low rank can match or exceed full-rank performance at the 350M scale.

Training

Architecture LaCT (Large-Chunk Test-Time Training, SwiGLU MLP fast weights)
Rank r64
Params ~350M (hidden=1024, layers=24, heads=16)
Dataset HuggingFaceFW/fineweb-edu (streaming)
Steps 10000
Effective batch 256
Sequence length 8000
Optimizer AdamW (lr=3e-4, eps=1e-15)
LR schedule Cosine, 512-step warmup, decay to 10%
Precision bf16
Activation checkpointing selective (option 1)
Tokens ~20.5 B

Eval results

  • FineWeb-Edu val PPL: 13.42
  • LAMBADA acc: 0.297
  • HellaSwag acc_norm: 0.367
  • ARC-Easy acc_norm: 0.470
  • ARC-Challenge acc_norm: 0.254
  • PIQA acc_norm: 0.648
  • WinoGrande acc: 0.520

Notes on the 350M sweep

  • Downstream eval discrimination comes online at 350M. At 100M, HellaSwag / LAMBADA were near-chance for most cells; at 350M they discriminate clearly between archs/ranks.
  • PPL doesn't linearly predict downstream. At matched ~374M, GLA rfull has worse FineWeb-Edu PPL than DeltaNet rfull (14.42 vs 12.55) but wins on every lm-harness task (LAMBADA, HellaSwag, PIQA, ARC-E).
  • GatedDeltaNet dominates at the cost of size. GDN rfull is 526M (head_dim=256 inflates q/k/v) and wins every metric; GDN r256 (432M) is the matched-param comparison and still leads.
  • LaCT is rank-robust at 350M. PPL/LAMBADA stay flat across r64 / r256 / rfull โ€” the cleanest evidence for the "low rank as regularization" hypothesis.

Run name: lact_350M_r64_bs256_lr3e-4_steps10000

Downloads last month
8
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train nlproj/lact_350M_r64_bs256_lr3e-4_steps10000