Hinduja Swiss IT-Professional-How do I preprocess or tokenize data for training language models?

John6666 · January 25, 2026, 2:02am

If you’re unsure about what to fine-tune, I think starting with either LLM Course or a smol course will help you avoid confusion.

1) Start by choosing your training objective (it determines the “right” preprocessing)

Causal LM (CLM, decoder-only; “next token prediction”)

You typically tokenize → concatenate → chunk into fixed-length blocks (block_size) for efficient training.
This is the pattern used in HF’s canonical CLM example script (run_clm.py). (GitHub)

Chat / instruction SFT (still CLM under the hood, but formatted as messages)

Your biggest risk is formatting + special tokens + label masking, not raw tokenization.
The safe default is to use chat templates correctly (details below). (Hugging Face)

MLM (BERT-style)

Tokenization is similar, but masking is usually applied by a data collator at batch time.

2) Core tools in the Hugging Face stack (and what each is for)

`datasets` (I/O + transformations)

Load data from files or Hub, transform with map(), filter, shuffle, stream big corpora.
If your dataset is too large to store locally, load in streaming mode to get an IterableDataset. (Hugging Face)

`transformers` tokenizers (text → token IDs)

Prefer Fast tokenizers (Rust-backed) for speed and consistent behavior. (Hugging Face)

Optional: large-scale data pipelines (dedup/filtering)

For web-scale preprocessing (filtering, dedup, etc.), HF’s DataTrove provides reference pipelines (e.g., the FineWeb processing script). (GitHub)

3) Data cleaning & quality filtering (what matters most before tokenization)

This step often dominates downstream model quality.

Minimum “always do it” cleaning

Normalize whitespace / remove null bytes / fix obvious encoding issues.
Drop pathological samples (extremely short, extremely long, repetitive junk).
Remove markup if your source is HTML.

Deduplicate (especially for pretraining / continued pretraining)

Duplicate data wastes compute and can leak evaluation examples into training.

FineWeb explicitly documents a pipeline of cleaning + dedup, and points to a working script for the full process. (Hugging Face)
The DataTrove repository includes an example script used to create FineWeb. (GitHub)

If you’re not operating at web scale, even exact-match dedup (hash the normalized text) gives a meaningful win.

4) Tokenizer strategy: reuse vs train a new one

Fine-tuning an existing model

Use the model’s tokenizer as-is. Changing vocab has knock-on effects and usually isn’t worth it.

Pretraining from scratch (or new language/domain where the tokenizer is a bad fit)

Train a tokenizer on a representative slice of your corpus.

HF’s LLM course shows train_new_from_iterator() as a practical approach (works with fast tokenizers). (Hugging Face)
The Transformers tokenizer docs explain fast vs slow tokenizers and expected capabilities. (Hugging Face)
HF also published a late-2025 overview of tokenization for LLMs (useful for updated mental models and API direction). (Hugging Face)

5) Tokenize efficiently with `datasets.map()` (speed + reproducibility)

Use batch mapping (`batched=True`)

Batch mapping is explicitly designed to speed up tokenization because tokenizers run faster on batches. (Hugging Face)

from datasets import load_dataset
from transformers import AutoTokenizer

ds = load_dataset("json", data_files={"train": "train.jsonl"})
tok = AutoTokenizer.from_pretrained("gpt2", use_fast=True)

def tokenize(batch):
    return tok(batch["text"], truncation=False)

tokenized = ds["train"].map(
    tokenize,
    batched=True,
    remove_columns=ds["train"].column_names,
)

Caching and saving artifacts

HF Datasets uses caching; if caching is disabled, your transforms can be recomputed and then deleted at session end unless you explicitly save the result. (Hugging Face)

tokenized.save_to_disk("./tokenized_train")

When `map()` slows down near the end

This is a common report in real workflows (often due to I/O, cache writes, or skewed example sizes). A typical mitigation is to shard, reduce output columns, and ensure fast local storage. (Hugging Face Forums)

6) CLM preprocessing: packing (concatenate + chunk) and boundary handling

The standard “group_texts” approach

The canonical CLM recipe is: tokenize → concatenate → slice into block_size chunks (often with labels = input_ids). This is the approach discussed around run_clm.py. (GitHub)

Boundary pitfall: “Should I insert EOS between documents?”

This is a frequently debated detail; there’s a dedicated issue asking whether run_clm.py should separate documents with a special token. (GitHub)

Practical guidance

If your samples are independent documents, append an EOS to each doc before concatenation to prevent unnatural “doc bleed”.
If your data is already a continuous stream (e.g., book text split into lines), you may choose not to.

Block-size pitfall: remainder handling

A known failure mode is producing chunks that aren’t exactly block_size, causing training errors. There’s an issue specifically about group_texts needing to drop incorrect-length sequences. (GitHub)

7) Chat / instruction SFT: use chat templates correctly (most important for your case)

Recommended default: `apply_chat_template(..., tokenize=True)`

Transformers explicitly warns that chat templates generally already include the special tokens; templating into text and then tokenizing “normally” can insert special tokens twice and degrade performance. (Hugging Face)

def chat_to_features(example, tokenizer):
    # example["messages"] = [{"role": "system"/"user"/"assistant", "content": "..."}]
    return tokenizer.apply_chat_template(
        example["messages"],
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True,
    )

If you do template → tokenize in two steps

Set add_special_tokens=False when tokenizing the rendered string, exactly as the docs recommend. (Hugging Face)

This issue shows a concrete example where templating then encoding results in duplicated BOS. (GitHub)

8) Labels and loss masking (assistant-only / completion-only training)

If you want loss only on the assistant output (common in instruction tuning):

TRL documents DataCollatorForCompletionOnlyLM and states it works only when packing=False. (Hugging Face)
There’s also an explicit TRL issue asking if you can combine packing with completion-only training (short answer: not directly “as-is”). (GitHub)

Practical recommendation

Start with correctness: completion-only + no packing (simple, reliable).
Only introduce packing after you have tests that confirm label masking does not cross sample boundaries.

9) Large datasets: when to stream instead of materialize

If the corpus is too large for local disk/RAM, use streaming:

streaming=True yields an IterableDataset you can iterate without downloading everything. (Hugging Face)
Be aware: streaming has different performance characteristics, and there are ongoing questions/issues about throughput and how it compares to map-style datasets. (GitHub)

A common production pattern is:

stream + light filtering →
write cleaned shards (e.g., parquet/jsonl) →
train on the stable shards with map-style datasets for speed.

10) A “best-practice checklist” (what tends to work well)

Tokenization & formatting

Use fast tokenizers (use_fast=True). (Hugging Face)
Use Dataset.map(..., batched=True) for tokenization speed. (Hugging Face)
For chat SFT: prefer apply_chat_template(tokenize=True); if not, set add_special_tokens=False. (Hugging Face)

CLM packing

Ensure chunking outputs exactly block_size (drop remainder). (GitHub)
Decide and document whether you insert EOS between documents (and keep it consistent). (GitHub)

Dataset ops & reproducibility

Remove unused columns early (remove_columns=...) to reduce I/O and cache size. (Hugging Face)
If caching is disabled, save_to_disk() or you’ll lose results at session end. (Hugging Face)

Scaling

Stream very large corpora, and materialize only cleaned/filtered shards you intend to train on. (Hugging Face)
For web-scale, follow a pipeline-style approach with filtering + dedup (FineWeb + DataTrove are good reference points). (Hugging Face)

Recommended “reading order” (fast path)

Batch mapping (datasets.map with batched=True). (Hugging Face)
Chat templating (and the special-token pitfall). (Hugging Face)
Completion-only SFT constraints in TRL (packing vs masking). (Hugging Face)
Streaming docs for big data. (Hugging Face)
FineWeb/DataTrove pipeline as a reference for real-world filtering/dedup. (Hugging Face)

Topic		Replies	Views
I’m training a tokenizer and model from scratch — how do I prepare my dataset and set the correct tokenizer parameters? Beginners	3	171	January 6, 2026
Save tokenizer with argument 🤗Tokenizers	2	2010	October 26, 2022
Pretokenization of dataset for finetuning 🤗Datasets	4	139	May 31, 2025
Tokenizer to dataset to datacollator Beginners	1	1362	April 28, 2022
Dataset Preparation for Q&A FineTuning Beginners	0	488	September 28, 2023