myX-Tokenizer: A Specialized Unigram Tokenizer for Myanmar NLP
myX-Tokenizer is a high-efficiency tokenizer specifically engineered for the Burmese (Myanmar) language, with extended support for English and Pali. It is designed to overcome the "fertility" issues common in multilingual tokenizers (like mBERT and XLM-R), where Burmese text is often over-fragmented into nearly character-level subwords.
Developed by Khant Sint Heinn (Kalix Louis) and released under DatarrX, this tokenizer provides a balanced representation of the Burmese language, optimized for modern Large Language Models (LLMs) and Embedding models.
Technical Specifications
- Algorithm: Unigram (SentencePiece)
- Vocabulary Size: 128,000
- Normalization: NMT NFKC Case-Folding (
nmt_nfkc_cf) - Byte Fallback: Enabled (to prevent Unknown
<unk>tokens) - Split Digits: False (Treats numbers as atomic units or logical groups)
- Special Tokens: Includes 20+ specialized control tokens for ChatML, RAG, Tool Use, and Reasoning.
Data Composition & Focus
The tokenizer was trained on a massive 40M+ line corpus with a strategic weight distribution to ensure cross-domain robustness:
| Language/Domain | Percentage | Sources |
|---|---|---|
| Burmese (General) | ~70% | Wikipedia, News (BBC Burmese), Literature, Written/Spoken Text |
| English | ~20% | High-quality English general sentences |
| Pali | ~10% | Tipitaka Dataset (Religious and Historical texts) |
Training Nuances
While the model is highly capable in Unicode formal and informal Burmese, users should note:
- Modern Slang/Non-standard Spelling: Quality may decrease for highly colloquial internet slang or intentional misspellings.
- Encoding Issues: Specifically optimized for Unicode. Performance on Zawgyi-encoded text or texts with severe Unicode sequence errors may be suboptimal.
Efficiency Benchmark
The primary goal of myX-Tokenizer is to optimize the "Token-to-Character" ratio for Myanmar NLP. By reducing the Fertility Rate, the model can process significantly more semantic information within the same context window compared to general multilingual tokenizers.
Quantitative Evaluation
To ensure an unbiased evaluation, we benchmarked the tokenizers using the jojo-ai-mst/Myanmar-Agricutlure-1K dataset (specifically the "Output" column with 1,053 rows). Note: This dataset was held out during the training phase of myX-Tokenizer to serve as a clean test set.
| Tokenizer | Avg Tokens per Sentence | Efficiency Score (Fertility) |
|---|---|---|
| myX-Tokenizer | 34.2 | Most Efficient |
| XLM-R | 64.05 | - |
| mBERT | 109.59 | - |
Qualitative Comparison: Segmentation Showcase
The following example demonstrates how each model segments a complex Burmese sentence. While multilingual models struggle with Burmese character boundaries, myX-Tokenizer preserves morphological and semantic integrity.
Test Sentence:
"လူမျိုးတစ်မျိုး၏ စာပေယဉ်ကျေးမှု တိမ်ကောပျောက်ကွယ်ခြင်းသည် ထိုလူမျိုး ကမ္ဘာ့မြေပုံပေါ်မှ ပျောက်ကွယ်သွားခြင်းပင် ဖြစ်သည်။"
myX-Tokenizer (17 tokens):
['▁လူမျိုး', 'တစ်မျိုး', '၏', '▁စာပေ', 'ယဉ်ကျေးမှု', '▁တိမ်ကော', 'ပျောက်ကွယ်', 'ခြင်းသည်', '▁ထို', 'လူမျိုး', '▁ကမ္ဘာ့', 'မြေပုံ', 'ပေါ်မှ', '▁ပျောက်ကွယ်', 'သွားခြင်း', 'ပင်', '▁ဖြစ်သည်။'](Observation: Highly semantic and readable units.)XLM-R (36 tokens):
['▁', 'လူမျိုး', 'တစ်', 'မျိုး', '၏', '▁စာ', 'ပေ', 'ယ', 'ဉ', '်', 'ကျ', 'ေး', 'မှု', '▁', 'တိ', 'မ်', 'ကော', 'ပျောက်', 'ကွယ်', 'ခြင်း', 'သည်', '▁ထို', 'လူမျိုး', '▁ကမ္ဘာ့', 'မြေ', 'ပုံ', 'ပေါ်', 'မှ', '▁', 'ပျောက်', 'ကွယ်', 'သွား', 'ခြင်း', 'ပင်', '▁', 'ဖြစ်သည်။'](Observation: Excessive fragmentation of simple words like 'ယဉ်ကျေးမှု'.)mBERT (68 tokens):
['လ', '##ူ', '##မ', '##ျိုး', '##တ', '##စ်', '##မ', '##ျိုး', '၏', 'စ', '##ာ', '##ပ', '##ေ', '##ယ', '##ဉ', '##်', '##က', '##ျ', '##ေး', '##မှု', 'တ', '##ိ', '##မ', '##်', '##က', '##ော', '##ပ', '##ျ', '##ောက်', '##က', '##ွ', '##ယ်', '##ခြင်း', '##သည်', 'ထ', '##ို', '##လ', '##ူ', '##မ', '##ျိုး', 'က', '##မ', '##္', '##ဘ', '##ာ', '##့', '##မ', '##ြ', '##ေ', '##ပ', '##ုံ', '##ပ', '##ေါ်', '##မှ', 'ပ', '##ျ', '##ောက်', '##က', '##ွ', '##ယ်', '##သ', '##ွ', '##ား', '##ခြင်း', '##ပ', '##င်', 'ဖြစ်သည်', '။'](Observation: Severe over-fragmentation; almost character-level, leading to high computational cost and loss of context.)
Usage
For the best results in your Myanmar NLP projects, we recommend using the tokenizer.model file directly with the sentencepiece library.
Python (SentencePiece)
import sentencepiece as spm
# Load the model
sp = spm.SentencePieceProcessor(model_file='tokenizer.model')
# Encode
text = "မြန်မာနိုင်ငံ၏ သတင်းထူးများ နှင့် Artificial Intelligence နည်းပညာ"
tokens = sp.encode_as_pieces(text)
ids = sp.encode_as_ids(text)
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
Integrated Special Tokens
The tokenizer includes pre-defined IDs for modern AI workflows:
- Chat:
<|im_start|>,<|im_end|>,<|system|>,<|user|>,<|assistant|> - Reasoning:
<|thought|>,<|reflection|> - RAG/Tools:
<|context_start|>,<|tool_call|>,<|tool_response|>
Quality Report (BBC News Benchmark)
| Metric | Ultra Result |
|---|---|
| Total Characters | 40,000,000+ |
| Density (Chars/Token) | ~2.5 - 3.1 |
| Vocab Coverage (%) | 100% (Byte Fallback active) |
Development & Distribution
- Developed by: Khant Sint Heinn (Kalix Louis)
- Published by: DatarrX (Myanmar Open Source NGO)
- License: Apache License 2.0
- Training Datasets:
- kalixlouiis/HFcourse-english-burmese-parallel-corpus
- kalixlouiis/Myanmar-English-general-text-translation
- kalixlouiis/myanmar-written-spoken-text-pairs
- kalixlouiis/myanmar-linguistic-ambiguitie-001
- kalixlouiis/general-burmese-sentences
- DatarrX/Burmese-English-Code-Mixed-Corpus
- DatarrX/tipitaka-dataset
- agentlans/high-quality-english-sentences
- DatarrX/myX-Burmese-Morpho-Synthetic
Citation
If you utilize this tokenizer in your research or applications, please use the following citation:
@software{khantsintheinn2026myxtokenizer,
author = {Khant Sint Heinn},
title = {myX-Tokenizer: A Specialized Unigram Tokenizer for Myanmar NLP},
year = {2026},
publisher = {DatarrX},
url = {https://huggingface.co/DatarrX/myX-Tokenize)}
}
About the Author
Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
Connect with the Author:
GitHub | Hugging Face | Kaggle
Disclaimer: This project is an ongoing effort to improve Myanmar language support in AI. We welcome feedback and contributions via the Community tab.
