Axiom-Dense-380M-Base

Axiom-Dense-380M-Base is a decoder-only causal language model trained from scratch for general-purpose next-token prediction on English web text. This is a base pretrained model, not an instruction-tuned chat model.

Model Summary

  • Model type: decoder-only Transformer (causal LM)
  • Parameter count: 385,849,344
  • Context length: 1,024 tokens
  • Vocabulary: 100,277 (tiktoken cl100k_base)
  • Training objective: autoregressive next-token prediction
  • Special handling: tied input/output embeddings (embed.weight tied to lm_head.weight)

Architecture

This model follows a dense Transformer stack with grouped-query attention and rotary positional embeddings.

  • Hidden size: 1024
  • Layers: 24
  • Attention heads: 16
  • KV heads: 8 (GQA)
  • FFN multiplier: 2.6667 (rounded to hardware-friendly multiple)
  • Normalization: RMSNorm
  • Positional encoding: RoPE (theta=10000)
  • Activation: SwiGLU
  • Dropout: 0.0

Implementation details are defined in:

  • model.py (core architecture and generation)
  • config.py (ModelConfig, TrainConfig)

Training Data

  • Source dataset: HuggingFaceFW/fineweb-edu, sample-10BT split
  • Local dataset path during training: data/fineweb-edu-10BT
  • Text field: text
  • Validation split strategy: deterministic hash split with val_fraction=0.001 and split_seed=1337
  • Document boundary treatment: EOS token appended after each document

Training Setup

  • Target tokens: 8,000,000,000
  • Effective tokens per optimizer step: 327,680 (batch_size=1, seq_len=1024, grad_accum=320)
  • Computed optimizer steps: 24,414
  • Planned tokens represented by training schedule: 7,999,979,520
  • Optimizer: AdamW8bit (fallback to AdamW if unavailable)
  • LR schedule: warmup, constant phase, cosine decay
  • Warmup steps: 2,000
  • LR max/min: 3e-4 / 1e-5
  • Weight decay: 0.1
  • Betas: (0.9, 0.95)
  • Gradient clipping: 1.0
  • Precision: bfloat16
  • Gradient checkpointing: enabled
  • Compile: disabled in provided config

Evaluation Snapshot

Validation metrics in this repo are tracked in eval.csv at interval checkpoints.

  • Best observed eval loss: 2.7394 at step 15,000
  • Best observed eval perplexity: 15.4780 at step 15,000
  • Final logged eval loss: 2.8972 at step 24,000
  • Final logged eval perplexity: 18.1233 at step 24,000

These are internal development metrics on the project validation split, not a broad benchmark suite.

Intended Use

  • Continued pretraining
  • Supervised finetuning or instruction tuning
  • Research and experimentation on medium-scale dense LMs
  • Educational use for studying custom Transformer implementations

Out-of-Scope / Not Recommended

  • Safety-critical or high-stakes decisions (medical, legal, financial)
  • Direct deployment as a reliable assistant without task-specific alignment and evaluation
  • Use cases requiring guaranteed factual accuracy

Limitations

  • Base model behavior: may produce repetitive, off-topic, or hallucinatory outputs
  • No instruction tuning by default
  • English-centric training distribution
  • Context window limited to 1,024 tokens
  • Bias/toxicity risks inherited from web-scale text data

Safety and Risk Notes

Potential harms include generation of incorrect, biased, or unsafe text. Downstream users should add:

  • Domain-specific evaluation
  • Prompt and output safety filtering
  • Human oversight for sensitive workflows
  • Red-teaming before production release

Tokenization

  • Tokenizer backend: tiktoken
  • Encoding: cl100k_base
  • Vocab size: 100,277
  • EOS token: tokenizer eot_token

Reproducibility

Core files relevant to reproducibility:

  • train.py (training loop, checkpointing, metrics)
  • data.py (dataset packing/streaming and deterministic split logic)
  • model.py (architecture)
  • config.py (model/training hyperparameters)

Seed configuration:

  • Python / NumPy / PyTorch seed: 1337

Usage

This repository contains custom model/tokenizer code paths. Load with the project code or with Hugging Face transformers remote code support if published with matching auto_map files.

Downloads last month
974
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train user-anto/Axiom-Dense-380M-Base