YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
KalpaTokenizer 128k
KalpaTokenizer 128k is a byte-fallback BPE tokenizer trained from scratch for English, Hindi in Devanagari, Hindi-Latin/Hinglish, and English-Hindi YouTube text.
Training Data
The tokenizer was trained on a 10.076B-word corpus generated with seed 20260530 and target source fractions of 30% / 30% / 30% / 5% / 5%.
| Source | Target fraction | Actual words |
|---|---|---|
fineweb2_en |
30% | 3,078,414,514 |
fineweb2_hin_deva |
30% | 3,007,721,942 |
fineweb2_hin_latn |
30% | 2,989,086,847 |
yt_enhi_deduped |
5% | 504,245,644 |
yt_subtitles_enhi |
5% | 496,491,725 |
| Total | 100% | 10,075,960,672 |
Sources were sampled with a 10,000-row shuffle buffer. yt_enhi_deduped was loaded in mapped mode because its HF streaming dataset has only one data shard.
Tokenizer
- Vocabulary size: 128,000
- Model max length: 65,536
- EOS token:
<|endoftext|> - PAD token:
<|endoftext|> - UNK token:
<unk>
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("kalpalabs/kalpatokenizer-128k")
ids = tokenizer.encode("Kal market me bahut bheed thi.", add_special_tokens=False)
text = tokenizer.decode(ids)
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support