Why is my token classification model performing poorly on custom data?

Based solely on that question, I can only offer a shallow and broad explanation…:


Why token-classification models often look “bad” on custom data

Token classification (NER, POS, slot filling) means predicting a label for each model token. With modern Transformers, “token” usually means subword pieces produced by the tokenizer (WordPiece/BPE), not your original “words.” The most common failures happen when your labels are attached to words, but training/evaluation happens on subword tokens. Hugging Face’s own guide explicitly highlights the required steps: map tokens back to words with word_ids(), set -100 on special tokens, and label only the first subtoken of each word (or handle subtokens very carefully). (Hugging Face)

Below are the most common reasons accuracy (or F1) ends up low, grouped by category, with symptoms + fixes.


1) Label ↔ token misalignment (most common cause)

Background: word labels vs subword tokens

A word like unbelievable might be tokenized into ["un", "##bel", "##ievable"]. If your dataset has one label per word, you must decide what labels the subtokens get. The standard recipe used in many reference scripts is:

  • Special tokens ([CLS], [SEP]) → label -100 (ignored)
  • Only the first subtoken of each word gets the word label
  • Remaining subtokens → -100 (ignored)

This is exactly what the HF token classification guide recommends. (Hugging Face)

Typical symptoms

  • Predictions appear shifted (entities start one token too early/late).
  • Model outputs mostly O, even on obvious entities.
  • Training loss decreases, but evaluation stays poor.
  • “Weird” behavior around punctuation, hyphenated words, IDs, emails, URLs.

Common alignment mistakes

  • Not using is_split_into_words=True when your inputs are already split into word tokens (so the tokenizer treats your list differently).
  • Not using the fast-tokenizer mapping (word_ids()) and instead assuming “one input token = one word.”
  • Forgetting to set -100 on special tokens and/or padding.

A long-running practical discussion is “how do I convert word-level labels to WordPiece labels?”—because this is exactly where many custom datasets break. (Hugging Face Forums)

How to diagnose (fast)

Pick ~10 random examples and print:

  • original words
  • model tokens (convert_ids_to_tokens)
  • word_ids() per token
  • aligned labels per token

If any word-to-token mapping is off, your score can collapse.


2) Padding and ignored tokens handled incorrectly (-100)

Background

Token classification training usually uses CrossEntropyLoss(ignore_index=-100) so certain token positions do not contribute to loss (special tokens, padding, often non-first subtokens). HF explicitly calls this out in docs and code patterns. (Hugging Face)

Symptoms

  • Low accuracy/F1 that doesn’t improve much with training.
  • Model learns to predict the padding label behavior.
  • Metrics change drastically with batch size or max length.

Fixes

  • Ensure label tensors are padded with -100 where inputs are padded.
  • Use DataCollatorForTokenClassification (or equivalent) after you have constructed correct labels for each example; the data collator pads, it doesn’t invent labels. (Hugging Face Forums)
  • Verify your metric computation ignores positions where label is -100 (the HF course shows the correct pattern). (Hugging Face)

3) Metric mismatch: “accuracy” isn’t measuring what you think

Background

For NER-like tasks, the most meaningful metric is often entity-level F1 (span-based), not raw token accuracy. HF’s course uses seqeval-style evaluation (decode predicted tag sequences after removing -100). (Hugging Face)

Common evaluation pitfalls

  • Reporting token accuracy when the dataset is dominated by O (accuracy becomes misleading).
  • Including -100 positions in evaluation.
  • Evaluating subtokens as if they were words.

A specific, known trap: label_all_tokens + seqeval

If you label every subtoken (e.g., repeating B-ORG on every subtoken), seqeval can treat each subtoken as a separate entity and the reported entity metrics can become “completely fudged.” This is documented in a Transformers issue. (GitHub)

Practical advice: start with “label first subtoken only” (others -100), get a stable baseline, then only experiment with labeling subtokens once everything else is correct. (Hugging Face)


4) Inference/post-processing issues (pipeline aggregation and subwords)

Even if the model is trained correctly, decoding can make it look wrong.

Background

The token-classification pipeline uses heuristics to decide whether a token is a “subword,” then applies an aggregation_strategy to merge pieces into entities. There are documented cases where this heuristic causes incorrect merging. (GitHub)

Symptoms

  • You see ## fragments or entities split oddly.
  • Offline evaluation seems okay, but pipeline outputs look wrong.
  • Aggregation doesn’t combine subwords as expected.

Fixes

  • Evaluate directly from logits (your own decoding) to confirm the model is fine.
  • In pipeline, experiment with aggregation_strategy="none" (diagnostic), then "simple" etc. The parameter is documented in the pipeline source. (GitHub)

5) Truncation / long documents (your entities may be cut off)

Background

Many custom datasets contain longer sequences (logs, legal clauses, clinical notes). If you truncate to max_length, you can drop entities or split them across windows.

Handling long texts correctly often requires return_overflowing_tokens=True plus a stride (sliding window) and correct label remapping per window. There’s an issue and forum discussion specifically about how to tokenize + align labels under overflow/stride. (GitHub)

Symptoms

  • Entities near the end of sentences/documents are never predicted.
  • Metrics improve when you shorten inputs.
  • Many examples hit exactly the max length.

Fix

Measure “% of samples truncated.” If high, implement windowing with stride and align labels per window using overflow mappings. (GitHub)


6) Label scheme problems (BIO/BIOES mistakes, invalid transitions)

Background

NER is often labeled with BIO/BIOES schemes that impose constraints (e.g., I-ORG should not follow B-PER without a reset). If your dataset has inconsistent tagging or invalid transitions, training becomes noisier and evaluation penalizes you.

Practical reports show models producing invalid BIO sequences and the importance of tagging constraints. (CEUR-WS)

Symptoms

  • Many errors are boundary mistakes (B vs I).
  • Confusion between entity types is less common than span fragmentation.
  • Your gold data contains invalid BIO transitions.

Fixes

  • Validate your dataset for BIO legality (simple script).
  • Consider constrained decoding or CRF-style approaches if boundaries are central (often a second-stage improvement).

7) Class imbalance (too many O, too few entities)

Background

Custom datasets frequently have extreme imbalance (most tokens are O). This can lead the model to predict O everywhere and still look “good” under some metrics—but poor under entity F1.

There are recurring requests/discussions about weighted loss for token classification to address imbalance. (GitHub)

Symptoms

  • High O prediction rate.
  • Some entity types have near-zero recall.
  • Per-class F1 shows only frequent types working.

Fixes

  • Report per-type metrics (not just overall).
  • Oversample sentences containing entities.
  • Use class-weighted loss or focal loss carefully (after pipeline correctness is confirmed). (GitHub)

8) Domain shift: pretrained model doesn’t match your text

If your custom data differs significantly from the base model’s pretraining domain (jargon, noisy OCR, special formatting, new entity surface forms), performance can be genuinely limited.

A standard high-impact technique is additional pretraining on unlabeled in-domain or task-adjacent text (DAPT/TAPT). (ACL Anthology)

Symptoms

  • Tokenizer breaks many words into many pieces.
  • Entities are domain-specific and rarely seen in general corpora.
  • Model misses patterns that humans find obvious in-domain.

Fix

Do domain/task-adaptive pretraining on unlabeled text from your distribution, then fine-tune again. (ACL Anthology)


A practical debugging order that usually finds the cause quickly

1) Check alignment visually (10 examples)

Follow the exact HF-recommended mapping (word_ids(), -100, first subtoken only). (Hugging Face)

2) Confirm evaluation is correct

  • Remove -100
  • Use entity-level metrics (seqeval-style)
    HF’s course shows the canonical approach. (Hugging Face)

3) Overfit a tiny subset (50–200 examples)

  • If you cannot get near-perfect training performance, it’s almost always a pipeline bug: alignment, labels, padding, mapping.

4) Compare direct decoding vs pipeline

If direct decoding is good but pipeline output looks bad, it’s aggregation/heuristics. (GitHub)

5) Check truncation rate

If many samples are truncated, implement stride/overflow label alignment. (GitHub)

6) Only then tune training knobs

Learning rate, epochs, warmup, batch size, etc. These matter, but they rarely fix a broken alignment/evaluation pipeline.


Quick “most likely fixes” summary

  • Fix label alignment using word_ids() and set -100 appropriately. (Hugging Face)
  • Stop using token accuracy alone; compute entity-level F1 and ensure -100 is excluded. (Hugging Face)
  • Avoid label_all_tokens until everything else works (and if you use it, handle BIO correctly). (GitHub)
  • Don’t trust pipeline output by default; validate aggregation strategies. (GitHub)
  • Handle long documents with stride/overflow alignment if truncation is common. (GitHub)
  • If it’s truly domain shift, do domain/task-adaptive pretraining. (ACL Anthology)