Instructions to use francescofiamingo1/FF_3.13 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use francescofiamingo1/FF_3.13 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="francescofiamingo1/FF_3.13")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("francescofiamingo1/FF_3.13")
model = AutoModelForCausalLM.from_pretrained("francescofiamingo1/FF_3.13")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use francescofiamingo1/FF_3.13 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "francescofiamingo1/FF_3.13"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "francescofiamingo1/FF_3.13",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/francescofiamingo1/FF_3.13

SGLang

How to use francescofiamingo1/FF_3.13 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "francescofiamingo1/FF_3.13" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "francescofiamingo1/FF_3.13",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "francescofiamingo1/FF_3.13" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "francescofiamingo1/FF_3.13",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use francescofiamingo1/FF_3.13 with Docker Model Runner:
```
docker model run hf.co/francescofiamingo1/FF_3.13
```

FF_3.13

Champion model of the FF-LLM line. A 2.02B GPT-2 architecture model fine-tuned through a multi-stage pipeline (pretraining → SFT → distillation → surgical fine-tuning → knowledge repair) for general-purpose factual question answering.

Model overview


Architecture	GPT-2 (causal LM)
Parameters	2.02B
Hidden size	2048
Layers	38
Attention heads	16
Vocab size	50,257
Tokenizer	GPT-2 BPE
Context length	1024 tokens
Precision	bfloat16 (also fp16/fp32 compatible)
License	Apache 2.0
Author	francescofiamingo1

Benchmark performance

MMLU (Massive Multitask Language Understanding)

Evaluated with lm-eval-harness v0.4.11, greedy decoding.

Split	Score
MMLU full (14,042 items)	28.05%
MMLU dev (285 items)	25.61%

Macro-domain breakdown (MMLU full)

Macro	Subjects	Accuracy
STEM	19 subjects	30.70%
Humanities	13 subjects	26.06%
Social Sciences	12 subjects	30.03%
Other (medicine, law, professional)	13 subjects	29.32%

106-bench (custom factual benchmark)

Custom 106-prompt benchmark with strict TRUTH-list scoring:

Category	N	Score
arithmetic	5	5/5 (100.0%)
open-ended	1	1/1 (100.0%)
person	25	22/25 (88.0%)
science	25	21/25 (84.0%)
geography	25	15/25 (60.0%)
format compliance	25	15/25 (60.0%)
TOTAL	106	79/106 (74.5%)

Improvement vs precursors

Model	MMLU full	Δ vs FF_3.13
FF_3 (base, original release)	—	—
FF_3.1 (post-SFT)	26.72%	-1.33pp
FF_3.11 (specialized variant)	25.20%	-2.85pp
FF_3.13 (this model)	28.05%	— champion

Training pipeline — chronological view

The model went through 7 distinct stages of training. Below is the complete history.

Stage 1 — Pretraining

Architecture chosen: GPT-2 (2.02B parameters), trained from scratch on a curated multi-source web + encyclopedic + educational corpus.

Item	Value
Hardware	8× NVIDIA RTX 5090 (24 GB each = 192 GB VRAM total)
Throughput	~220,000 tokens/sec sustained (100% GPU utilization, all 8 GPUs in parallel)
Framework	PyTorch + DeepSpeed ZeRO-2
Precision	bfloat16
Total pretraining tokens	~90 billion tokens
Wall-clock pretraining time	~5 days continuous (90B / 220K tok/s ≈ 4.7 days)

Pretraining data composition

The pretraining corpus was assembled from 8 distinct sources, organized in two training modules (M1 BASE / Extra and M2 BASE / Extra) with quality-tiered weighting.

Dataset	Module	Type	Quality weight
FineWeb general	M1 BASE	Web	medium
FineWeb 10BT	M1 Extra25	Web high quality	medium
FineWeb EDU	M2 BASE	Educational	high
FineWeb EDU extended	M2 Extra	Educational reasoning	medium
C4 EN	M1 C4	Web filtered	medium
Wikipedia EN	M1 BASE	Encyclopedic	low
Web Clean custom	M1 BASE / Extra	Web filtered	low
News crawl	M1 BASE	Journalistic	low

Mix proportions (approximate)

60–65% FineWeb (various slices: general, 10BT, EDU, EDU extended)
15–20% C4 EN
5–10% Wikipedia EN
5–10% Web Clean custom
~5% News crawl

This mix prioritizes educational content (FineWeb EDU = high weight) and high-quality web text, with encyclopedic and journalistic sources providing factual grounding.

Stage 2 — Supervised Fine-Tuning (SFT)

Objective: general instruction-following + factual knowledge alignment.

Data sources (~860K total examples):

OpenHermes (cleaned)
UltraChat (cleaned)
WildChat (cleaned)
Numina (math reasoning)
OpenThoughts (chain-of-thought)
Eurus (multi-task)

Composition: ~760K core + 100K augmentation examples. Sharded under s3://ff-llm-datasets/sft/shards_v2/.

Stage 3 — Direct Preference Optimization (DPO) — REJECTED

Two DPO experiments were attempted and discarded:

DPO variant	Pairs	Result
v1 — WizardLM/Alpaca preferences	38,863	-3pp MMLU → rejected
v2 — UltraFeedback (argilla/ultrafeedback-binarized)	60,917	-3pp MMLU → rejected

Lesson: DPO consistently caused MMLU regression (~-3pp) regardless of hyperparameters. Not used in the final model.

Stage 4 — Distillation v3

Knowledge distillation from larger teacher models on a curated question set.

Item	Value
Total questions	108,779
Source mix	hellaswag (37%), openhermes (28%), mmlu (14%), math (12%), gsm8k (7%), arc (2%), truthfulqa (1%)
S3 path	`s3://ff-llm-datasets/distill_v3/`

Stage 5 — LoRA experiments — REJECTED

Multiple LoRA fine-tuning attempts were tried for surgical improvements:

LoRA experiment	Examples	Result
LoRA v4b (synthetic instruction)	6,000	marginal, not promoted
LoRA format-only v1/v3	1,779–2,092	catastrophic forgetting (-3 to -4pp MMLU full)

Lesson: LoRA at LR ≥ 5e-4 with template-structured data caused the model to overfit to template patterns rather than learn generalizable behavior. Not used in the final model.

Stage 6 — Surgical Fine-Tuning

Targeted fine-tuning on a small curated set focused on output discipline (yes/no answers, single-letter MCQ, exact-N lists, numeric-only).

Item	Value
Examples	3,000
Path	`D:\ff_llm\ff31_surgical.jsonl`

Stage 7 — Knowledge Repair Training (produces FF_3.13)

The decisive stage that turned FF_3.11 into FF_3.13.

Dataset composition (16,006 total):

Block	Description	Examples
Block A	MMLU-style MCQ (multiple choice questions across diverse subjects)	10,714
Block B	Factual concise (TruthfulQA-like, <100 char answers)	929
Block D	Numeric microreasoning (arithmetic word problems with step solutions)	3,562
Validation set	held-out for monitoring	801

Training configuration:

Item	Value
Hardware	8× NVIDIA RTX 5090
Framework	DeepSpeed ZeRO-2
Precision	bfloat16
Optimizer	AdamW
Learning rate	2.5e-6 (cosine schedule)
Epochs	3 (early-stopped at step 200/357)
Effective batch size	(configured for 8-GPU DDP)
Wall-clock	~30 min total

Checkpoint sweep & selection:

Checkpoint	MMLU full	Status
1-epoch ckpt-100	27.47%	not selected
3-epoch ckpt-50	27.21%	not selected
3-epoch ckpt-100	27.86%	not selected
3-epoch ckpt-150	28.05%	CHAMPION → FF_3.13 ✅
3-epoch ckpt-200	28.17%	rejected (+0.12pp marginal, regressed 5/6 weak domains)

Compute infrastructure summary

Resource	Specification
GPU	8× NVIDIA RTX 5090 (24 GB VRAM each, 192 GB total)
GPU utilization	~100% sustained during training
Throughput (pretraining)	~220,000 tokens/sec
Distributed training	DeepSpeed ZeRO-2
Numerical precision	bfloat16 (training and inference)
Cloud provider	Vast.ai

Recommended usage

Prompt template (Alpaca-style)

### System:
You are FF-LLM, a helpful assistant.

### Instruction:
<your question>

### Response:

Decoding settings

Always use greedy decoding (do_sample=False).
Sampling has been shown to degrade factual accuracy by ~5pp on this model family.

Quick start (transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "francescofiamingo1/FF_3.13"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")

prompt = """### System:
You are FF-LLM, a helpful assistant.

### Instruction:
What is the capital of France?

### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations and known weaknesses

2B parameters — knowledge ceiling lower than 7B+ models
Format compliance moderate: 60% on strict format-discipline bench (yes/no, exact-N, single-letter)
Entity disambiguation weakness: occasional "anchor entity" over-attribution (e.g., default to Edison for inventor questions)
Weak domains (per qualitative analysis): mathematics, literature, music, art
Strong domains: biology, geography, basic science, factual short-form QA

Variant lineage

Variant	Status	Notes
FF_3	base	initial release
FF_3.1	published	post-SFT, MMLU 26.72%
FF_3.2	discontinued	early experiment, not maintained
FF_3.11	published	specialized variant, MMLU 25.20%, 106-bench 71%
FF_3.13	current champion	knowledge repair on FF_3.11 base, MMLU 28.05%
FF_3.14	rejected	full SFT with humanities focus, MMLU flat (no improvement)
SLERP t=0.10 (FF_3.13 + FF_3.11)	candidate backup	MMLU 29.10% (+0.41pp), 106-bench tie

Reproducibility

All training data shards, scripts, and intermediate checkpoints are tracked in cloud storage:

Datasets: s3://ff-llm-datasets/
Champion model: s3://ff-llm-datasets/champions/latest/
Build scripts: s3://ff-llm-datasets/ff314/build/ (includes Block E/F builders, anti-anchoring tables, philosophy seeds)

For reproduction support, contact the author.

Citation

@misc{ff_3_13_2026,
  author       = {francescofiamingo1},
  title        = {FF_3.13: a 2B GPT-2 model with knowledge repair fine-tuning},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/francescofiamingo1/FF_3.13}}
}

Last updated: 2026-04-18

Downloads last month: 114

Safetensors

Model size

2B params

Tensor type

BF16