FF_3.13

Champion model of the FF-LLM line. A 2.02B GPT-2 architecture model fine-tuned through a multi-stage pipeline (pretraining β†’ SFT β†’ distillation β†’ surgical fine-tuning β†’ knowledge repair) for general-purpose factual question answering.


Model overview

Architecture GPT-2 (causal LM)
Parameters 2.02B
Hidden size 2048
Layers 38
Attention heads 16
Vocab size 50,257
Tokenizer GPT-2 BPE
Context length 1024 tokens
Precision bfloat16 (also fp16/fp32 compatible)
License Apache 2.0
Author francescofiamingo1

Benchmark performance

MMLU (Massive Multitask Language Understanding)

Evaluated with lm-eval-harness v0.4.11, greedy decoding.

Split Score
MMLU full (14,042 items) 28.05%
MMLU dev (285 items) 25.61%

Macro-domain breakdown (MMLU full)

Macro Subjects Accuracy
STEM 19 subjects 30.70%
Humanities 13 subjects 26.06%
Social Sciences 12 subjects 30.03%
Other (medicine, law, professional) 13 subjects 29.32%

106-bench (custom factual benchmark)

Custom 106-prompt benchmark with strict TRUTH-list scoring:

Category N Score
arithmetic 5 5/5 (100.0%)
open-ended 1 1/1 (100.0%)
person 25 22/25 (88.0%)
science 25 21/25 (84.0%)
geography 25 15/25 (60.0%)
format compliance 25 15/25 (60.0%)
TOTAL 106 79/106 (74.5%)

Improvement vs precursors

Model MMLU full Ξ” vs FF_3.13
FF_3 (base, original release) β€” β€”
FF_3.1 (post-SFT) 26.72% -1.33pp
FF_3.11 (specialized variant) 25.20% -2.85pp
FF_3.13 (this model) 28.05% β€” champion

Training pipeline β€” chronological view

The model went through 7 distinct stages of training. Below is the complete history.

Stage 1 β€” Pretraining

Architecture chosen: GPT-2 (2.02B parameters), trained from scratch on a curated multi-source web + encyclopedic + educational corpus.

Item Value
Hardware 8Γ— NVIDIA RTX 5090 (24 GB each = 192 GB VRAM total)
Throughput ~220,000 tokens/sec sustained (100% GPU utilization, all 8 GPUs in parallel)
Framework PyTorch + DeepSpeed ZeRO-2
Precision bfloat16
Total pretraining tokens ~90 billion tokens
Wall-clock pretraining time ~5 days continuous (90B / 220K tok/s β‰ˆ 4.7 days)

Pretraining data composition

The pretraining corpus was assembled from 8 distinct sources, organized in two training modules (M1 BASE / Extra and M2 BASE / Extra) with quality-tiered weighting.

Dataset Module Type Quality weight
FineWeb general M1 BASE Web medium
FineWeb 10BT M1 Extra25 Web high quality medium
FineWeb EDU M2 BASE Educational high
FineWeb EDU extended M2 Extra Educational reasoning medium
C4 EN M1 C4 Web filtered medium
Wikipedia EN M1 BASE Encyclopedic low
Web Clean custom M1 BASE / Extra Web filtered low
News crawl M1 BASE Journalistic low

Mix proportions (approximate)

  • 60–65% FineWeb (various slices: general, 10BT, EDU, EDU extended)
  • 15–20% C4 EN
  • 5–10% Wikipedia EN
  • 5–10% Web Clean custom
  • ~5% News crawl

This mix prioritizes educational content (FineWeb EDU = high weight) and high-quality web text, with encyclopedic and journalistic sources providing factual grounding.

Stage 2 β€” Supervised Fine-Tuning (SFT)

Objective: general instruction-following + factual knowledge alignment.

Data sources (~860K total examples):

  • OpenHermes (cleaned)
  • UltraChat (cleaned)
  • WildChat (cleaned)
  • Numina (math reasoning)
  • OpenThoughts (chain-of-thought)
  • Eurus (multi-task)

Composition: ~760K core + 100K augmentation examples. Sharded under s3://ff-llm-datasets/sft/shards_v2/.

Stage 3 β€” Direct Preference Optimization (DPO) β€” REJECTED

Two DPO experiments were attempted and discarded:

DPO variant Pairs Result
v1 β€” WizardLM/Alpaca preferences 38,863 -3pp MMLU β†’ rejected
v2 β€” UltraFeedback (argilla/ultrafeedback-binarized) 60,917 -3pp MMLU β†’ rejected

Lesson: DPO consistently caused MMLU regression (~-3pp) regardless of hyperparameters. Not used in the final model.

Stage 4 β€” Distillation v3

Knowledge distillation from larger teacher models on a curated question set.

Item Value
Total questions 108,779
Source mix hellaswag (37%), openhermes (28%), mmlu (14%), math (12%), gsm8k (7%), arc (2%), truthfulqa (1%)
S3 path s3://ff-llm-datasets/distill_v3/

Stage 5 β€” LoRA experiments β€” REJECTED

Multiple LoRA fine-tuning attempts were tried for surgical improvements:

LoRA experiment Examples Result
LoRA v4b (synthetic instruction) 6,000 marginal, not promoted
LoRA format-only v1/v3 1,779–2,092 catastrophic forgetting (-3 to -4pp MMLU full)

Lesson: LoRA at LR β‰₯ 5e-4 with template-structured data caused the model to overfit to template patterns rather than learn generalizable behavior. Not used in the final model.

Stage 6 β€” Surgical Fine-Tuning

Targeted fine-tuning on a small curated set focused on output discipline (yes/no answers, single-letter MCQ, exact-N lists, numeric-only).

Item Value
Examples 3,000
Path D:\ff_llm\ff31_surgical.jsonl

Stage 7 β€” Knowledge Repair Training (produces FF_3.13)

The decisive stage that turned FF_3.11 into FF_3.13.

Dataset composition (16,006 total):

Block Description Examples
Block A MMLU-style MCQ (multiple choice questions across diverse subjects) 10,714
Block B Factual concise (TruthfulQA-like, <100 char answers) 929
Block D Numeric microreasoning (arithmetic word problems with step solutions) 3,562
Validation set held-out for monitoring 801

Training configuration:

Item Value
Hardware 8Γ— NVIDIA RTX 5090
Framework DeepSpeed ZeRO-2
Precision bfloat16
Optimizer AdamW
Learning rate 2.5e-6 (cosine schedule)
Epochs 3 (early-stopped at step 200/357)
Effective batch size (configured for 8-GPU DDP)
Wall-clock ~30 min total

Checkpoint sweep & selection:

Checkpoint MMLU full Status
1-epoch ckpt-100 27.47% not selected
3-epoch ckpt-50 27.21% not selected
3-epoch ckpt-100 27.86% not selected
3-epoch ckpt-150 28.05% CHAMPION β†’ FF_3.13 βœ…
3-epoch ckpt-200 28.17% rejected (+0.12pp marginal, regressed 5/6 weak domains)

Compute infrastructure summary

Resource Specification
GPU 8Γ— NVIDIA RTX 5090 (24 GB VRAM each, 192 GB total)
GPU utilization ~100% sustained during training
Throughput (pretraining) ~220,000 tokens/sec
Distributed training DeepSpeed ZeRO-2
Numerical precision bfloat16 (training and inference)
Cloud provider Vast.ai

Recommended usage

Prompt template (Alpaca-style)

### System:
You are FF-LLM, a helpful assistant.

### Instruction:
<your question>

### Response:

Decoding settings

  • Always use greedy decoding (do_sample=False).
  • Sampling has been shown to degrade factual accuracy by ~5pp on this model family.

Quick start (transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "francescofiamingo1/FF_3.13"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")

prompt = """### System:
You are FF-LLM, a helpful assistant.

### Instruction:
What is the capital of France?

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations and known weaknesses

  • 2B parameters β€” knowledge ceiling lower than 7B+ models
  • Format compliance moderate: 60% on strict format-discipline bench (yes/no, exact-N, single-letter)
  • Entity disambiguation weakness: occasional "anchor entity" over-attribution (e.g., default to Edison for inventor questions)
  • Weak domains (per qualitative analysis): mathematics, literature, music, art
  • Strong domains: biology, geography, basic science, factual short-form QA

Variant lineage

Variant Status Notes
FF_3 base initial release
FF_3.1 published post-SFT, MMLU 26.72%
FF_3.2 discontinued early experiment, not maintained
FF_3.11 published specialized variant, MMLU 25.20%, 106-bench 71%
FF_3.13 current champion knowledge repair on FF_3.11 base, MMLU 28.05%
FF_3.14 rejected full SFT with humanities focus, MMLU flat (no improvement)
SLERP t=0.10 (FF_3.13 + FF_3.11) candidate backup MMLU 29.10% (+0.41pp), 106-bench tie

Reproducibility

All training data shards, scripts, and intermediate checkpoints are tracked in cloud storage:

  • Datasets: s3://ff-llm-datasets/
  • Champion model: s3://ff-llm-datasets/champions/latest/
  • Build scripts: s3://ff-llm-datasets/ff314/build/ (includes Block E/F builders, anti-anchoring tables, philosophy seeds)

For reproduction support, contact the author.


Citation

@misc{ff_3_13_2026,
  author       = {francescofiamingo1},
  title        = {FF_3.13: a 2B GPT-2 model with knowledge repair fine-tuning},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/francescofiamingo1/FF_3.13}}
}

Last updated: 2026-04-18

Downloads last month
114
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support