Hinduja Swiss IT-Professional-How do I preprocess or tokenize data for training language models?

If you’re unsure about what to fine-tune, I think starting with either LLM Course or a smol course will help you avoid confusion.


1) Start by choosing your training objective (it determines the “right” preprocessing)

Causal LM (CLM, decoder-only; “next token prediction”)

  • You typically tokenize → concatenate → chunk into fixed-length blocks (block_size) for efficient training.
  • This is the pattern used in HF’s canonical CLM example script (run_clm.py). (GitHub)

Chat / instruction SFT (still CLM under the hood, but formatted as messages)

  • Your biggest risk is formatting + special tokens + label masking, not raw tokenization.
  • The safe default is to use chat templates correctly (details below). (Hugging Face)

MLM (BERT-style)

  • Tokenization is similar, but masking is usually applied by a data collator at batch time.

2) Core tools in the Hugging Face stack (and what each is for)

:hugs: datasets (I/O + transformations)

  • Load data from files or Hub, transform with map(), filter, shuffle, stream big corpora.
  • If your dataset is too large to store locally, load in streaming mode to get an IterableDataset. (Hugging Face)

:hugs: transformers tokenizers (text → token IDs)

  • Prefer Fast tokenizers (Rust-backed) for speed and consistent behavior. (Hugging Face)

Optional: large-scale data pipelines (dedup/filtering)

  • For web-scale preprocessing (filtering, dedup, etc.), HF’s DataTrove provides reference pipelines (e.g., the FineWeb processing script). (GitHub)

3) Data cleaning & quality filtering (what matters most before tokenization)

This step often dominates downstream model quality.

Minimum “always do it” cleaning

  • Normalize whitespace / remove null bytes / fix obvious encoding issues.
  • Drop pathological samples (extremely short, extremely long, repetitive junk).
  • Remove markup if your source is HTML.

Deduplicate (especially for pretraining / continued pretraining)

Duplicate data wastes compute and can leak evaluation examples into training.

  • FineWeb explicitly documents a pipeline of cleaning + dedup, and points to a working script for the full process. (Hugging Face)
  • The DataTrove repository includes an example script used to create FineWeb. (GitHub)

If you’re not operating at web scale, even exact-match dedup (hash the normalized text) gives a meaningful win.


4) Tokenizer strategy: reuse vs train a new one

Fine-tuning an existing model

Use the model’s tokenizer as-is. Changing vocab has knock-on effects and usually isn’t worth it.

Pretraining from scratch (or new language/domain where the tokenizer is a bad fit)

Train a tokenizer on a representative slice of your corpus.

  • HF’s LLM course shows train_new_from_iterator() as a practical approach (works with fast tokenizers). (Hugging Face)
  • The Transformers tokenizer docs explain fast vs slow tokenizers and expected capabilities. (Hugging Face)
  • HF also published a late-2025 overview of tokenization for LLMs (useful for updated mental models and API direction). (Hugging Face)

5) Tokenize efficiently with datasets.map() (speed + reproducibility)

Use batch mapping (batched=True)

Batch mapping is explicitly designed to speed up tokenization because tokenizers run faster on batches. (Hugging Face)

from datasets import load_dataset
from transformers import AutoTokenizer

ds = load_dataset("json", data_files={"train": "train.jsonl"})
tok = AutoTokenizer.from_pretrained("gpt2", use_fast=True)

def tokenize(batch):
    return tok(batch["text"], truncation=False)

tokenized = ds["train"].map(
    tokenize,
    batched=True,
    remove_columns=ds["train"].column_names,
)

Caching and saving artifacts

  • HF Datasets uses caching; if caching is disabled, your transforms can be recomputed and then deleted at session end unless you explicitly save the result. (Hugging Face)
tokenized.save_to_disk("./tokenized_train")

When map() slows down near the end

This is a common report in real workflows (often due to I/O, cache writes, or skewed example sizes). A typical mitigation is to shard, reduce output columns, and ensure fast local storage. (Hugging Face Forums)


6) CLM preprocessing: packing (concatenate + chunk) and boundary handling

The standard “group_texts” approach

The canonical CLM recipe is: tokenize → concatenate → slice into block_size chunks (often with labels = input_ids). This is the approach discussed around run_clm.py. (GitHub)

Boundary pitfall: “Should I insert EOS between documents?”

This is a frequently debated detail; there’s a dedicated issue asking whether run_clm.py should separate documents with a special token. (GitHub)

Practical guidance

  • If your samples are independent documents, append an EOS to each doc before concatenation to prevent unnatural “doc bleed”.
  • If your data is already a continuous stream (e.g., book text split into lines), you may choose not to.

Block-size pitfall: remainder handling

A known failure mode is producing chunks that aren’t exactly block_size, causing training errors. There’s an issue specifically about group_texts needing to drop incorrect-length sequences. (GitHub)


7) Chat / instruction SFT: use chat templates correctly (most important for your case)

Recommended default: apply_chat_template(..., tokenize=True)

Transformers explicitly warns that chat templates generally already include the special tokens; templating into text and then tokenizing “normally” can insert special tokens twice and degrade performance. (Hugging Face)

def chat_to_features(example, tokenizer):
    # example["messages"] = [{"role": "system"/"user"/"assistant", "content": "..."}]
    return tokenizer.apply_chat_template(
        example["messages"],
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True,
    )

If you do template → tokenize in two steps

Set add_special_tokens=False when tokenizing the rendered string, exactly as the docs recommend. (Hugging Face)

This issue shows a concrete example where templating then encoding results in duplicated BOS. (GitHub)


8) Labels and loss masking (assistant-only / completion-only training)

If you want loss only on the assistant output (common in instruction tuning):

  • TRL documents DataCollatorForCompletionOnlyLM and states it works only when packing=False. (Hugging Face)
  • There’s also an explicit TRL issue asking if you can combine packing with completion-only training (short answer: not directly “as-is”). (GitHub)

Practical recommendation

  • Start with correctness: completion-only + no packing (simple, reliable).
  • Only introduce packing after you have tests that confirm label masking does not cross sample boundaries.

9) Large datasets: when to stream instead of materialize

If the corpus is too large for local disk/RAM, use streaming:

  • streaming=True yields an IterableDataset you can iterate without downloading everything. (Hugging Face)
  • Be aware: streaming has different performance characteristics, and there are ongoing questions/issues about throughput and how it compares to map-style datasets. (GitHub)

A common production pattern is:

  1. stream + light filtering →
  2. write cleaned shards (e.g., parquet/jsonl) →
  3. train on the stable shards with map-style datasets for speed.

10) A “best-practice checklist” (what tends to work well)

Tokenization & formatting

  • Use fast tokenizers (use_fast=True). (Hugging Face)
  • Use Dataset.map(..., batched=True) for tokenization speed. (Hugging Face)
  • For chat SFT: prefer apply_chat_template(tokenize=True); if not, set add_special_tokens=False. (Hugging Face)

CLM packing

  • Ensure chunking outputs exactly block_size (drop remainder). (GitHub)
  • Decide and document whether you insert EOS between documents (and keep it consistent). (GitHub)

Dataset ops & reproducibility

  • Remove unused columns early (remove_columns=...) to reduce I/O and cache size. (Hugging Face)
  • If caching is disabled, save_to_disk() or you’ll lose results at session end. (Hugging Face)

Scaling

  • Stream very large corpora, and materialize only cleaned/filtered shards you intend to train on. (Hugging Face)
  • For web-scale, follow a pipeline-style approach with filtering + dedup (FineWeb + DataTrove are good reference points). (Hugging Face)

Recommended “reading order” (fast path)

  1. Batch mapping (datasets.map with batched=True). (Hugging Face)
  2. Chat templating (and the special-token pitfall). (Hugging Face)
  3. Completion-only SFT constraints in TRL (packing vs masking). (Hugging Face)
  4. Streaming docs for big data. (Hugging Face)
  5. FineWeb/DataTrove pipeline as a reference for real-world filtering/dedup. (Hugging Face)