YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

KalpaTokenizer 128k

KalpaTokenizer 128k is a byte-fallback BPE tokenizer trained from scratch for English, Hindi in Devanagari, Hindi-Latin/Hinglish, and English-Hindi YouTube text.

Training Data

The tokenizer was trained on a 10.076B-word corpus generated with seed 20260530 and target source fractions of 30% / 30% / 30% / 5% / 5%.

Source Target fraction Actual words
fineweb2_en 30% 3,078,414,514
fineweb2_hin_deva 30% 3,007,721,942
fineweb2_hin_latn 30% 2,989,086,847
yt_enhi_deduped 5% 504,245,644
yt_subtitles_enhi 5% 496,491,725
Total 100% 10,075,960,672

Sources were sampled with a 10,000-row shuffle buffer. yt_enhi_deduped was loaded in mapped mode because its HF streaming dataset has only one data shard.

Tokenizer

  • Vocabulary size: 128,000
  • Model max length: 65,536
  • EOS token: <|endoftext|>
  • PAD token: <|endoftext|>
  • UNK token: <unk>

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("kalpalabs/kalpatokenizer-128k")
ids = tokenizer.encode("Kal market me bahut bheed thi.", add_special_tokens=False)
text = tokenizer.decode(ids)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support