Instructions to use francescofiamingo1/FF_3.13 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use francescofiamingo1/FF_3.13 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="francescofiamingo1/FF_3.13")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("francescofiamingo1/FF_3.13") model = AutoModelForCausalLM.from_pretrained("francescofiamingo1/FF_3.13") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use francescofiamingo1/FF_3.13 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "francescofiamingo1/FF_3.13" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "francescofiamingo1/FF_3.13", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/francescofiamingo1/FF_3.13
- SGLang
How to use francescofiamingo1/FF_3.13 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "francescofiamingo1/FF_3.13" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "francescofiamingo1/FF_3.13", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "francescofiamingo1/FF_3.13" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "francescofiamingo1/FF_3.13", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use francescofiamingo1/FF_3.13 with Docker Model Runner:
docker model run hf.co/francescofiamingo1/FF_3.13
FF_3.13
Champion model of the FF-LLM line. A 2.02B GPT-2 architecture model fine-tuned through a multi-stage pipeline (pretraining β SFT β distillation β surgical fine-tuning β knowledge repair) for general-purpose factual question answering.
Model overview
| Architecture | GPT-2 (causal LM) |
| Parameters | 2.02B |
| Hidden size | 2048 |
| Layers | 38 |
| Attention heads | 16 |
| Vocab size | 50,257 |
| Tokenizer | GPT-2 BPE |
| Context length | 1024 tokens |
| Precision | bfloat16 (also fp16/fp32 compatible) |
| License | Apache 2.0 |
| Author | francescofiamingo1 |
Benchmark performance
MMLU (Massive Multitask Language Understanding)
Evaluated with lm-eval-harness v0.4.11, greedy decoding.
| Split | Score |
|---|---|
| MMLU full (14,042 items) | 28.05% |
| MMLU dev (285 items) | 25.61% |
Macro-domain breakdown (MMLU full)
| Macro | Subjects | Accuracy |
|---|---|---|
| STEM | 19 subjects | 30.70% |
| Humanities | 13 subjects | 26.06% |
| Social Sciences | 12 subjects | 30.03% |
| Other (medicine, law, professional) | 13 subjects | 29.32% |
106-bench (custom factual benchmark)
Custom 106-prompt benchmark with strict TRUTH-list scoring:
| Category | N | Score |
|---|---|---|
| arithmetic | 5 | 5/5 (100.0%) |
| open-ended | 1 | 1/1 (100.0%) |
| person | 25 | 22/25 (88.0%) |
| science | 25 | 21/25 (84.0%) |
| geography | 25 | 15/25 (60.0%) |
| format compliance | 25 | 15/25 (60.0%) |
| TOTAL | 106 | 79/106 (74.5%) |
Improvement vs precursors
| Model | MMLU full | Ξ vs FF_3.13 |
|---|---|---|
| FF_3 (base, original release) | β | β |
| FF_3.1 (post-SFT) | 26.72% | -1.33pp |
| FF_3.11 (specialized variant) | 25.20% | -2.85pp |
| FF_3.13 (this model) | 28.05% | β champion |
Training pipeline β chronological view
The model went through 7 distinct stages of training. Below is the complete history.
Stage 1 β Pretraining
Architecture chosen: GPT-2 (2.02B parameters), trained from scratch on a curated multi-source web + encyclopedic + educational corpus.
| Item | Value |
|---|---|
| Hardware | 8Γ NVIDIA RTX 5090 (24 GB each = 192 GB VRAM total) |
| Throughput | ~220,000 tokens/sec sustained (100% GPU utilization, all 8 GPUs in parallel) |
| Framework | PyTorch + DeepSpeed ZeRO-2 |
| Precision | bfloat16 |
| Total pretraining tokens | ~90 billion tokens |
| Wall-clock pretraining time | ~5 days continuous (90B / 220K tok/s β 4.7 days) |
Pretraining data composition
The pretraining corpus was assembled from 8 distinct sources, organized in two training modules (M1 BASE / Extra and M2 BASE / Extra) with quality-tiered weighting.
| Dataset | Module | Type | Quality weight |
|---|---|---|---|
| FineWeb general | M1 BASE | Web | medium |
| FineWeb 10BT | M1 Extra25 | Web high quality | medium |
| FineWeb EDU | M2 BASE | Educational | high |
| FineWeb EDU extended | M2 Extra | Educational reasoning | medium |
| C4 EN | M1 C4 | Web filtered | medium |
| Wikipedia EN | M1 BASE | Encyclopedic | low |
| Web Clean custom | M1 BASE / Extra | Web filtered | low |
| News crawl | M1 BASE | Journalistic | low |
Mix proportions (approximate)
- 60β65% FineWeb (various slices: general, 10BT, EDU, EDU extended)
- 15β20% C4 EN
- 5β10% Wikipedia EN
- 5β10% Web Clean custom
- ~5% News crawl
This mix prioritizes educational content (FineWeb EDU = high weight) and high-quality web text, with encyclopedic and journalistic sources providing factual grounding.
Stage 2 β Supervised Fine-Tuning (SFT)
Objective: general instruction-following + factual knowledge alignment.
Data sources (~860K total examples):
- OpenHermes (cleaned)
- UltraChat (cleaned)
- WildChat (cleaned)
- Numina (math reasoning)
- OpenThoughts (chain-of-thought)
- Eurus (multi-task)
Composition: ~760K core + 100K augmentation examples. Sharded under s3://ff-llm-datasets/sft/shards_v2/.
Stage 3 β Direct Preference Optimization (DPO) β REJECTED
Two DPO experiments were attempted and discarded:
| DPO variant | Pairs | Result |
|---|---|---|
| v1 β WizardLM/Alpaca preferences | 38,863 | -3pp MMLU β rejected |
| v2 β UltraFeedback (argilla/ultrafeedback-binarized) | 60,917 | -3pp MMLU β rejected |
Lesson: DPO consistently caused MMLU regression (~-3pp) regardless of hyperparameters. Not used in the final model.
Stage 4 β Distillation v3
Knowledge distillation from larger teacher models on a curated question set.
| Item | Value |
|---|---|
| Total questions | 108,779 |
| Source mix | hellaswag (37%), openhermes (28%), mmlu (14%), math (12%), gsm8k (7%), arc (2%), truthfulqa (1%) |
| S3 path | s3://ff-llm-datasets/distill_v3/ |
Stage 5 β LoRA experiments β REJECTED
Multiple LoRA fine-tuning attempts were tried for surgical improvements:
| LoRA experiment | Examples | Result |
|---|---|---|
| LoRA v4b (synthetic instruction) | 6,000 | marginal, not promoted |
| LoRA format-only v1/v3 | 1,779β2,092 | catastrophic forgetting (-3 to -4pp MMLU full) |
Lesson: LoRA at LR β₯ 5e-4 with template-structured data caused the model to overfit to template patterns rather than learn generalizable behavior. Not used in the final model.
Stage 6 β Surgical Fine-Tuning
Targeted fine-tuning on a small curated set focused on output discipline (yes/no answers, single-letter MCQ, exact-N lists, numeric-only).
| Item | Value |
|---|---|
| Examples | 3,000 |
| Path | D:\ff_llm\ff31_surgical.jsonl |
Stage 7 β Knowledge Repair Training (produces FF_3.13)
The decisive stage that turned FF_3.11 into FF_3.13.
Dataset composition (16,006 total):
| Block | Description | Examples |
|---|---|---|
| Block A | MMLU-style MCQ (multiple choice questions across diverse subjects) | 10,714 |
| Block B | Factual concise (TruthfulQA-like, <100 char answers) | 929 |
| Block D | Numeric microreasoning (arithmetic word problems with step solutions) | 3,562 |
| Validation set | held-out for monitoring | 801 |
Training configuration:
| Item | Value |
|---|---|
| Hardware | 8Γ NVIDIA RTX 5090 |
| Framework | DeepSpeed ZeRO-2 |
| Precision | bfloat16 |
| Optimizer | AdamW |
| Learning rate | 2.5e-6 (cosine schedule) |
| Epochs | 3 (early-stopped at step 200/357) |
| Effective batch size | (configured for 8-GPU DDP) |
| Wall-clock | ~30 min total |
Checkpoint sweep & selection:
| Checkpoint | MMLU full | Status |
|---|---|---|
| 1-epoch ckpt-100 | 27.47% | not selected |
| 3-epoch ckpt-50 | 27.21% | not selected |
| 3-epoch ckpt-100 | 27.86% | not selected |
| 3-epoch ckpt-150 | 28.05% | CHAMPION β FF_3.13 β |
| 3-epoch ckpt-200 | 28.17% | rejected (+0.12pp marginal, regressed 5/6 weak domains) |
Compute infrastructure summary
| Resource | Specification |
|---|---|
| GPU | 8Γ NVIDIA RTX 5090 (24 GB VRAM each, 192 GB total) |
| GPU utilization | ~100% sustained during training |
| Throughput (pretraining) | ~220,000 tokens/sec |
| Distributed training | DeepSpeed ZeRO-2 |
| Numerical precision | bfloat16 (training and inference) |
| Cloud provider | Vast.ai |
Recommended usage
Prompt template (Alpaca-style)
### System:
You are FF-LLM, a helpful assistant.
### Instruction:
<your question>
### Response:
Decoding settings
- Always use greedy decoding (
do_sample=False). - Sampling has been shown to degrade factual accuracy by ~5pp on this model family.
Quick start (transformers)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "francescofiamingo1/FF_3.13"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="cuda")
prompt = """### System:
You are FF-LLM, a helpful assistant.
### Instruction:
What is the capital of France?
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations and known weaknesses
- 2B parameters β knowledge ceiling lower than 7B+ models
- Format compliance moderate: 60% on strict format-discipline bench (yes/no, exact-N, single-letter)
- Entity disambiguation weakness: occasional "anchor entity" over-attribution (e.g., default to Edison for inventor questions)
- Weak domains (per qualitative analysis): mathematics, literature, music, art
- Strong domains: biology, geography, basic science, factual short-form QA
Variant lineage
| Variant | Status | Notes |
|---|---|---|
| FF_3 | base | initial release |
| FF_3.1 | published | post-SFT, MMLU 26.72% |
| FF_3.2 | discontinued | early experiment, not maintained |
| FF_3.11 | published | specialized variant, MMLU 25.20%, 106-bench 71% |
| FF_3.13 | current champion | knowledge repair on FF_3.11 base, MMLU 28.05% |
| FF_3.14 | rejected | full SFT with humanities focus, MMLU flat (no improvement) |
| SLERP t=0.10 (FF_3.13 + FF_3.11) | candidate backup | MMLU 29.10% (+0.41pp), 106-bench tie |
Reproducibility
All training data shards, scripts, and intermediate checkpoints are tracked in cloud storage:
- Datasets:
s3://ff-llm-datasets/ - Champion model:
s3://ff-llm-datasets/champions/latest/ - Build scripts:
s3://ff-llm-datasets/ff314/build/(includes Block E/F builders, anti-anchoring tables, philosophy seeds)
For reproduction support, contact the author.
Citation
@misc{ff_3_13_2026,
author = {francescofiamingo1},
title = {FF_3.13: a 2B GPT-2 model with knowledge repair fine-tuning},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/francescofiamingo1/FF_3.13}}
}
Last updated: 2026-04-18
- Downloads last month
- 114