If you’re unsure about what to fine-tune, I think starting with either LLM Course or a smol course will help you avoid confusion.
1) Start by choosing your training objective (it determines the “right” preprocessing)
Causal LM (CLM, decoder-only; “next token prediction”)
- You typically tokenize → concatenate → chunk into fixed-length blocks (
block_size) for efficient training. - This is the pattern used in HF’s canonical CLM example script (
run_clm.py). (GitHub)
Chat / instruction SFT (still CLM under the hood, but formatted as messages)
- Your biggest risk is formatting + special tokens + label masking, not raw tokenization.
- The safe default is to use chat templates correctly (details below). (Hugging Face)
MLM (BERT-style)
- Tokenization is similar, but masking is usually applied by a data collator at batch time.
2) Core tools in the Hugging Face stack (and what each is for)
datasets (I/O + transformations)
- Load data from files or Hub, transform with
map(), filter, shuffle, stream big corpora. - If your dataset is too large to store locally, load in streaming mode to get an
IterableDataset. (Hugging Face)
transformers tokenizers (text → token IDs)
- Prefer Fast tokenizers (Rust-backed) for speed and consistent behavior. (Hugging Face)
Optional: large-scale data pipelines (dedup/filtering)
- For web-scale preprocessing (filtering, dedup, etc.), HF’s DataTrove provides reference pipelines (e.g., the FineWeb processing script). (GitHub)
3) Data cleaning & quality filtering (what matters most before tokenization)
This step often dominates downstream model quality.
Minimum “always do it” cleaning
- Normalize whitespace / remove null bytes / fix obvious encoding issues.
- Drop pathological samples (extremely short, extremely long, repetitive junk).
- Remove markup if your source is HTML.
Deduplicate (especially for pretraining / continued pretraining)
Duplicate data wastes compute and can leak evaluation examples into training.
- FineWeb explicitly documents a pipeline of cleaning + dedup, and points to a working script for the full process. (Hugging Face)
- The DataTrove repository includes an example script used to create FineWeb. (GitHub)
If you’re not operating at web scale, even exact-match dedup (hash the normalized text) gives a meaningful win.
4) Tokenizer strategy: reuse vs train a new one
Fine-tuning an existing model
Use the model’s tokenizer as-is. Changing vocab has knock-on effects and usually isn’t worth it.
Pretraining from scratch (or new language/domain where the tokenizer is a bad fit)
Train a tokenizer on a representative slice of your corpus.
- HF’s LLM course shows
train_new_from_iterator()as a practical approach (works with fast tokenizers). (Hugging Face) - The Transformers tokenizer docs explain fast vs slow tokenizers and expected capabilities. (Hugging Face)
- HF also published a late-2025 overview of tokenization for LLMs (useful for updated mental models and API direction). (Hugging Face)
5) Tokenize efficiently with datasets.map() (speed + reproducibility)
Use batch mapping (batched=True)
Batch mapping is explicitly designed to speed up tokenization because tokenizers run faster on batches. (Hugging Face)
from datasets import load_dataset
from transformers import AutoTokenizer
ds = load_dataset("json", data_files={"train": "train.jsonl"})
tok = AutoTokenizer.from_pretrained("gpt2", use_fast=True)
def tokenize(batch):
return tok(batch["text"], truncation=False)
tokenized = ds["train"].map(
tokenize,
batched=True,
remove_columns=ds["train"].column_names,
)
Caching and saving artifacts
- HF Datasets uses caching; if caching is disabled, your transforms can be recomputed and then deleted at session end unless you explicitly save the result. (Hugging Face)
tokenized.save_to_disk("./tokenized_train")
When map() slows down near the end
This is a common report in real workflows (often due to I/O, cache writes, or skewed example sizes). A typical mitigation is to shard, reduce output columns, and ensure fast local storage. (Hugging Face Forums)
6) CLM preprocessing: packing (concatenate + chunk) and boundary handling
The standard “group_texts” approach
The canonical CLM recipe is: tokenize → concatenate → slice into block_size chunks (often with labels = input_ids). This is the approach discussed around run_clm.py. (GitHub)
Boundary pitfall: “Should I insert EOS between documents?”
This is a frequently debated detail; there’s a dedicated issue asking whether run_clm.py should separate documents with a special token. (GitHub)
Practical guidance
- If your samples are independent documents, append an EOS to each doc before concatenation to prevent unnatural “doc bleed”.
- If your data is already a continuous stream (e.g., book text split into lines), you may choose not to.
Block-size pitfall: remainder handling
A known failure mode is producing chunks that aren’t exactly block_size, causing training errors. There’s an issue specifically about group_texts needing to drop incorrect-length sequences. (GitHub)
7) Chat / instruction SFT: use chat templates correctly (most important for your case)
Recommended default: apply_chat_template(..., tokenize=True)
Transformers explicitly warns that chat templates generally already include the special tokens; templating into text and then tokenizing “normally” can insert special tokens twice and degrade performance. (Hugging Face)
def chat_to_features(example, tokenizer):
# example["messages"] = [{"role": "system"/"user"/"assistant", "content": "..."}]
return tokenizer.apply_chat_template(
example["messages"],
tokenize=True,
add_generation_prompt=False,
return_dict=True,
)
If you do template → tokenize in two steps
Set add_special_tokens=False when tokenizing the rendered string, exactly as the docs recommend. (Hugging Face)
This issue shows a concrete example where templating then encoding results in duplicated BOS. (GitHub)
8) Labels and loss masking (assistant-only / completion-only training)
If you want loss only on the assistant output (common in instruction tuning):
- TRL documents
DataCollatorForCompletionOnlyLMand states it works only whenpacking=False. (Hugging Face) - There’s also an explicit TRL issue asking if you can combine packing with completion-only training (short answer: not directly “as-is”). (GitHub)
Practical recommendation
- Start with correctness: completion-only + no packing (simple, reliable).
- Only introduce packing after you have tests that confirm label masking does not cross sample boundaries.
9) Large datasets: when to stream instead of materialize
If the corpus is too large for local disk/RAM, use streaming:
streaming=Trueyields anIterableDatasetyou can iterate without downloading everything. (Hugging Face)- Be aware: streaming has different performance characteristics, and there are ongoing questions/issues about throughput and how it compares to map-style datasets. (GitHub)
A common production pattern is:
- stream + light filtering →
- write cleaned shards (e.g., parquet/jsonl) →
- train on the stable shards with map-style datasets for speed.
10) A “best-practice checklist” (what tends to work well)
Tokenization & formatting
- Use fast tokenizers (
use_fast=True). (Hugging Face) - Use
Dataset.map(..., batched=True)for tokenization speed. (Hugging Face) - For chat SFT: prefer
apply_chat_template(tokenize=True); if not, setadd_special_tokens=False. (Hugging Face)
CLM packing
- Ensure chunking outputs exactly
block_size(drop remainder). (GitHub) - Decide and document whether you insert EOS between documents (and keep it consistent). (GitHub)
Dataset ops & reproducibility
- Remove unused columns early (
remove_columns=...) to reduce I/O and cache size. (Hugging Face) - If caching is disabled,
save_to_disk()or you’ll lose results at session end. (Hugging Face)
Scaling
- Stream very large corpora, and materialize only cleaned/filtered shards you intend to train on. (Hugging Face)
- For web-scale, follow a pipeline-style approach with filtering + dedup (FineWeb + DataTrove are good reference points). (Hugging Face)
Recommended “reading order” (fast path)
- Batch mapping (
datasets.mapwithbatched=True). (Hugging Face) - Chat templating (and the special-token pitfall). (Hugging Face)
- Completion-only SFT constraints in TRL (packing vs masking). (Hugging Face)
- Streaming docs for big data. (Hugging Face)
- FineWeb/DataTrove pipeline as a reference for real-world filtering/dedup. (Hugging Face)