myX-Tokenizer: A Specialized Unigram Tokenizer for Myanmar NLP

myX-Tokenizer is a high-efficiency tokenizer specifically engineered for the Burmese (Myanmar) language, with extended support for English and Pali. It is designed to overcome the "fertility" issues common in multilingual tokenizers (like mBERT and XLM-R), where Burmese text is often over-fragmented into nearly character-level subwords.

Developed by Khant Sint Heinn (Kalix Louis) and released under DatarrX, this tokenizer provides a balanced representation of the Burmese language, optimized for modern Large Language Models (LLMs) and Embedding models.

Technical Specifications

Algorithm: Unigram (SentencePiece)
Vocabulary Size: 128,000
Normalization: NMT NFKC Case-Folding (nmt_nfkc_cf)
Byte Fallback: Enabled (to prevent Unknown <unk> tokens)
Split Digits: False (Treats numbers as atomic units or logical groups)
Special Tokens: Includes 20+ specialized control tokens for ChatML, RAG, Tool Use, and Reasoning.

Data Composition & Focus

The tokenizer was trained on a massive 40M+ line corpus with a strategic weight distribution to ensure cross-domain robustness:

Language/Domain	Percentage	Sources
Burmese (General)	~70%	Wikipedia, News (BBC Burmese), Literature, Written/Spoken Text
English	~20%	High-quality English general sentences
Pali	~10%	Tipitaka Dataset (Religious and Historical texts)

Training Nuances

While the model is highly capable in Unicode formal and informal Burmese, users should note:

Modern Slang/Non-standard Spelling: Quality may decrease for highly colloquial internet slang or intentional misspellings.
Encoding Issues: Specifically optimized for Unicode. Performance on Zawgyi-encoded text or texts with severe Unicode sequence errors may be suboptimal.

Efficiency Benchmark

The primary goal of myX-Tokenizer is to optimize the "Token-to-Character" ratio for Myanmar NLP. By reducing the Fertility Rate, the model can process significantly more semantic information within the same context window compared to general multilingual tokenizers.

Quantitative Evaluation

To ensure an unbiased evaluation, we benchmarked the tokenizers using the jojo-ai-mst/Myanmar-Agricutlure-1K dataset (specifically the "Output" column with 1,053 rows). Note: This dataset was held out during the training phase of myX-Tokenizer to serve as a clean test set.

Tokenizer	Avg Tokens per Sentence	Efficiency Score (Fertility)
myX-Tokenizer	34.2	Most Efficient
XLM-R	64.05	-
mBERT	109.59	-

Qualitative Comparison: Segmentation Showcase

The following example demonstrates how each model segments a complex Burmese sentence. While multilingual models struggle with Burmese character boundaries, myX-Tokenizer preserves morphological and semantic integrity.

Test Sentence:

"လူမျိုးတစ်မျိုး၏ စာပေယဉ်ကျေးမှု တိမ်ကောပျောက်ကွယ်ခြင်းသည် ထိုလူမျိုး ကမ္ဘာ့မြေပုံပေါ်မှ ပျောက်ကွယ်သွားခြင်းပင် ဖြစ်သည်။"

myX-Tokenizer (17 tokens): ['▁လူမျိုး', 'တစ်မျိုး', '၏', '▁စာပေ', 'ယဉ်ကျေးမှု', '▁တိမ်ကော', 'ပျောက်ကွယ်', 'ခြင်းသည်', '▁ထို', 'လူမျိုး', '▁ကမ္ဘာ့', 'မြေပုံ', 'ပေါ်မှ', '▁ပျောက်ကွယ်', 'သွားခြင်း', 'ပင်', '▁ဖြစ်သည်။'] (Observation: Highly semantic and readable units.)
XLM-R (36 tokens): ['▁', 'လူမျိုး', 'တစ်', 'မျိုး', '၏', '▁စာ', 'ပေ', 'ယ', 'ဉ', '်', 'ကျ', 'ေး', 'မှု', '▁', 'တိ', 'မ်', 'ကော', 'ပျောက်', 'ကွယ်', 'ခြင်း', 'သည်', '▁ထို', 'လူမျိုး', '▁ကမ္ဘာ့', 'မြေ', 'ပုံ', 'ပေါ်', 'မှ', '▁', 'ပျောက်', 'ကွယ်', 'သွား', 'ခြင်း', 'ပင်', '▁', 'ဖြစ်သည်။'] (Observation: Excessive fragmentation of simple words like 'ယဉ်ကျေးမှု'.)
mBERT (68 tokens): ['လ', '##ူ', '##မ', '##ျိုး', '##တ', '##စ်', '##မ', '##ျိုး', '၏', 'စ', '##ာ', '##ပ', '##ေ', '##ယ', '##ဉ', '##်', '##က', '##ျ', '##ေး', '##မှု', 'တ', '##ိ', '##မ', '##်', '##က', '##ော', '##ပ', '##ျ', '##ောက်', '##က', '##ွ', '##ယ်', '##ခြင်း', '##သည်', 'ထ', '##ို', '##လ', '##ူ', '##မ', '##ျိုး', 'က', '##မ', '##္', '##ဘ', '##ာ', '##့', '##မ', '##ြ', '##ေ', '##ပ', '##ုံ', '##ပ', '##ေါ်', '##မှ', 'ပ', '##ျ', '##ောက်', '##က', '##ွ', '##ယ်', '##သ', '##ွ', '##ား', '##ခြင်း', '##ပ', '##င်', 'ဖြစ်သည်', '။'] (Observation: Severe over-fragmentation; almost character-level, leading to high computational cost and loss of context.)

Usage

For the best results in your Myanmar NLP projects, we recommend using the tokenizer.model file directly with the sentencepiece library.

Python (SentencePiece)

import sentencepiece as spm

# Load the model
sp = spm.SentencePieceProcessor(model_file='tokenizer.model')

# Encode
text = "မြန်မာနိုင်ငံ၏ သတင်းထူးများ နှင့် Artificial Intelligence နည်းပညာ"
tokens = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)

print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

Integrated Special Tokens

The tokenizer includes pre-defined IDs for modern AI workflows:

Chat: <|im_start|>, <|im_end|>, <|system|>, <|user|>, <|assistant|>
Reasoning: <|thought|>, <|reflection|>
RAG/Tools: <|context_start|>, <|tool_call|>, <|tool_response|>

Quality Report (BBC News Benchmark)

Metric	Ultra Result
Total Characters	40,000,000+
Density (Chars/Token)	~2.5 - 3.1
Vocab Coverage (%)	100% (Byte Fallback active)

Development & Distribution

Developed by: Khant Sint Heinn (Kalix Louis)
Published by: DatarrX (Myanmar Open Source NGO)
License: Apache License 2.0
Training Datasets:
- kalixlouiis/HFcourse-english-burmese-parallel-corpus
- kalixlouiis/Myanmar-English-general-text-translation
- kalixlouiis/myanmar-written-spoken-text-pairs
- kalixlouiis/myanmar-linguistic-ambiguitie-001
- kalixlouiis/general-burmese-sentences
- DatarrX/Burmese-English-Code-Mixed-Corpus
- DatarrX/tipitaka-dataset
- agentlans/high-quality-english-sentences
- DatarrX/myX-Burmese-Morpho-Synthetic

Citation

If you utilize this tokenizer in your research or applications, please use the following citation:

@software{khantsintheinn2026myxtokenizer,
  author = {Khant Sint Heinn},
  title = {myX-Tokenizer: A Specialized Unigram Tokenizer for Myanmar NLP},
  year = {2026},
  publisher = {DatarrX},
  url = {https://huggingface.co/DatarrX/myX-Tokenize)}
}

About the Author

Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

Connect with the Author:
GitHub | Hugging Face | Kaggle

Disclaimer: This project is an ongoing effort to improve Myanmar language support in AI. We welcome feedback and contributions via the Community tab.

Downloads last month: -; Downloads are not tracked for this model. How to track

DatarrX
/

myX-Tokenizer