HSSM v2 250M

HSSM v2 is a hierarchical state-space language model with sparse Mixture-of-Experts routing for autoregressive text generation. This release contains the FineWeb-Edu pretrained checkpoint published by DevHunterAI.

Model Summary

HSSM v2 combines local depthwise temporal mixing, chunk-level hierarchical state propagation, residual gating, and sparse Mixture-of-Experts feed-forward blocks in a single causal language model.

This release corresponds to the pretrained checkpoint:

hssm_v2_250m_fineweb_edu_final.pt

Model scale:

Total parameters: 250,040,256 (~250M)
Active parameters per token path: 26,534,400 (~26.5M)
Architecture: sparse MoE language model with top-1 expert routing in MoE layers

This checkpoint was pretrained on:

HuggingFaceFW/fineweb-edu
1.25B tokens

Training note:

pretrained in approximately 2 hours on an NVIDIA RTX Pro 6000 Blackwell GPU

Intended Use

This model is intended for:

research on hierarchical state-space language models
experimentation with sparse expert routing for autoregressive text generation
continued fine-tuning on dialogue, instruction, or domain datasets
architecture analysis and comparison against transformer and recurrent baselines

This checkpoint is pretrained, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.

Training Dataset

The pretraining data source for this release is:

Dataset: HuggingFaceFW/fineweb-edu
Usage mode: streaming pretraining pipeline
Token budget: 1.25B tokens
Domain: educational and general web text

FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.

Architecture Overview

HSSM v2 is organized as a stacked hierarchical autoregressive architecture with token embeddings, ten HSSM blocks, final normalization, and a tied language modeling head.

Core configuration

vocab_size = 50257
d_model = 288
n_layers = 10
d_ff = 512
state_rank = 128
chunk_size = 8
num_experts = 64
experts_per_token = 1
expert_dim = 2048
moe_every = 4
tie_embeddings = true

Block structure

Each HSSM v2 block follows this pattern:

RMSNorm
HierarchicalStateMixer
residual add
RMSNorm
GatedMLP or SparseMoE
residual add

Every 4th block uses SparseMoE, so with 10 layers this release contains 2 MoE blocks.

HierarchicalStateMixer

The mixer replaces standard attention with a combination of:

depthwise Conv1d local temporal mixing
chunking with chunk_size=8
mean pooling over chunk windows
state compression 288 -> 128
state expansion 128 -> 288
repeat-interleave back to token length
gated residual fusion followed by output projection

This gives the model a hybrid inductive bias with local token interaction and chunk-level state propagation.

Sparse MoE

Sparse MoE blocks use:

64 experts
top-1 routing per token
expert hidden size 2048
auxiliary load-balancing loss

Only one expert path is active per token in each MoE layer, which is why the active parameter count is much smaller than the total parameter count.

Output head

After the final RMSNorm, the model projects hidden states to vocabulary logits using a tied LM head that shares weights with the token embedding matrix.

Training Details

Tokens are embedded into a continuous space.
Local token interactions are modeled with depthwise convolution.
Chunk summaries are compressed into latent states and expanded back across token positions.
Sparse MoE blocks increase capacity with top-1 expert routing.
Final logits are produced for next-token prediction.

Additional training facts for this release:

Pretraining tokens: 1.25B
Training hardware: NVIDIA RTX Pro 6000 Blackwell
Approximate pretraining duration: 2 hours
Objective: autoregressive next-token prediction with auxiliary MoE load-balancing loss

Known Limitations

Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:

repetitive continuations
weak dialogue alignment
unstable chat behavior on open-ended prompts
sensitivity to tokenizer choice

For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.

Files in This Repository

hssm_v2_250m_fineweb_edu_final.pt — pretrained HSSM v2 checkpoint
HSSM_v2_architecture.png — architecture image shown in this model card
hssm_v2_gpu_pretrain.py — training/model definition reference
hssm_pretrained_chat.py — local loading and generation helper

Example Loading (PyTorch)

from hssm_pretrained_chat import load_pretrained, generate_reply

tokenizer, model = load_pretrained(
    "hssm_v2_250m_fineweb_edu_final.pt",
    "gpt2",
    device="cpu",
)

reply = generate_reply(
    model=model,
    tokenizer=tokenizer,
    prompt="What is machine learning?",
    max_length=40,
    temperature=0.0,
    top_k=4,
    top_p=0.65,
    repetition_penalty=1.9,
    no_repeat_ngram_size=6,
)

print(reply)

Repository / Author

Model name: HSSM v2 250M
Publisher: DevHunterAI
Checkpoint type: pretrained public release

Citation

If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.

Downloads last month: -; Downloads are not tracked for this model. How to track

DevHunterAI
/

HSSM-v2-250M