muyo/Olmo-3-1025-7B-tokenizer-bos

Tokenizer derived from allenai/Olmo-3-1025-7B with a single modification: the added token at id 100256 (originally <|extra_id_0|>) has been renamed to <|beginoftext|> and registered as the BOS token. The original tokenizer used <|endoftext|> for both BOS and EOS; this version separates them so that diffusion / non-autoregressive training pipelines can mark sequence starts without ambiguity.

bos_token = <|beginoftext|> (id 100256)
eos_token = <|endoftext|> (id 100257)
Vocab size unchanged; model embeddings stay compatible — only one token's surface text changes.

Usage

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("muyo/Olmo-3-1025-7B-tokenizer-bos")
assert tok.bos_token_id == 100256
assert tok.eos_token_id != tok.bos_token_id

ids = tok("hello world", add_special_tokens=True)["input_ids"]
# First id is 100256 (<|beginoftext|>); prepending is handled by the tokenizer.json
# TemplateProcessing post-processor.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for muyo/Olmo-3-1025-7B-tokenizer-bos

Base model

allenai/Olmo-3-1025-7B

Finetuned

(81)

this model