muyo/Olmo-3-1025-7B-tokenizer-bos

Tokenizer derived from allenai/Olmo-3-1025-7B with a single modification: the added token at id 100256 (originally <|extra_id_0|>) has been renamed to <|beginoftext|> and registered as the BOS token. The original tokenizer used <|endoftext|> for both BOS and EOS; this version separates them so that diffusion / non-autoregressive training pipelines can mark sequence starts without ambiguity.

  • bos_token = <|beginoftext|> (id 100256)
  • eos_token = <|endoftext|> (id 100257)
  • Vocab size unchanged; model embeddings stay compatible — only one token's surface text changes.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("muyo/Olmo-3-1025-7B-tokenizer-bos")
assert tok.bos_token_id == 100256
assert tok.eos_token_id != tok.bos_token_id

ids = tok("hello world", add_special_tokens=True)["input_ids"]
# First id is 100256 (<|beginoftext|>); prepending is handled by the tokenizer.json
# TemplateProcessing post-processor.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for muyo/Olmo-3-1025-7B-tokenizer-bos

Finetuned
(81)
this model